txindex: hash keys and pack positions to reduce disk usage #35531

andrewtoth commented at 10:33 PM on June 14, 2026: contributor

The current txindex uses the full 32-byte txid as keys, which takes up about 66 GB of disk space today on mainnet. Using a 5-byte key prefix instead drops the disk usage to 26 GB - cutting the size to less than half.

Using the full 32-bytes is unnecessary since a 5-byte salted siphash will produce collisions in about 1 in 1.1 trillion. Some collisions will occur, but the penalty is just an extra disk read, deserialization and hash. The tx position can be appended to the key instead of used as a value, and a LevelDB iterator can seek to the prefix and then scan for the correct tx. This is an almost identical approach to txospenderindex.

Also instead of storing the file position of the block, we can store only the sequence of the connected block and offset of the transaction in the block. This can be packed into a 6-byte key suffix using 3-byte representations of the sequence and offset in the block. The block file can be recovered by the CBlockIndex that is already in memory. The sequence is mapped to the block hash in the db, so we can lookup the block hash to find the CBlockIndex during reads.

If a tx is not found with this method, we fallback to looking up the legacy entry. With this method a user with an existing db can opt to erase the indexes/txindex folder and reindex, or keep the current index and new entries will be appended with the smaller footprint.

The time to index was faster on my machine with this method, 1h19m vs current 1h50m. Lookups are roughly the same, around 0.2ms per lookup with getrawtransaction. When testing on mainnet, I got ~860k 2-way collisions, and 1 3-way collision that worst case could cause an extra 2 false positives when reading.

DrahtBot commented at 10:33 PM on June 14, 2026: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/35531.

Reviews

See the guideline and AI policy for information on the review process.

Type	Reviewers
Concept ACK	optout21, theStack, sedited, arejula27
Stale ACK	l0rinc

If your review is incorrectly listed, please copy-paste <code></code> into the comment that the bot should ignore.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#35728 (rpc: Properly throw on internal I/O errors in GetTransaction by maflcko)
#35713 (Remove boost as a unit test runner by rustaceanrob)
#34729 (Reduce log noise by ajtowns)
#34132 (coins: drop error catcher, centralize fatal read handling by l0rinc)
#33324 (blocks: add resumable reobfuscation for existing block files by l0rinc)
#24230 (indexes: Stop using node internal types and locking cs_main, improve sync logic by ryanofsky)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

LLM Linter (✨ experimental)

Possible typos and grammar issues:

must be now be preferred -> must now be preferred [extra “be” breaks the sentence]

2026-07-29 22:16:15

andrewtoth renamed this:
~~txindex: use siphash keys to optimize disk usage~~
txindex: use 8-byte siphash keys to optimize disk usage
on Jun 14, 2026

sipa commented at 10:43 PM on June 14, 2026: member

With 1.376e9 transactions in total, the chance of having at least one collision if each is given a 64-bit random identifier, is around 5%.

andrewtoth commented at 10:46 PM on June 14, 2026: contributor

@sipa Yes, apologies if I was not clear in the PR description, but collisions are handled with this change. When testing I had 2 sets of 2 txs that collided.

sipa commented at 10:51 PM on June 14, 2026: member

Oh, I see! Sorry, I saw the number of 1 in 18.4 quintillion and jumped to conclusions.

Neat. You're essentially treating the database as a set of (txid siphash, tx data) pairs, rather than as a (txid siphash) -> (tx data) map, so it functions as a multimap instead.

andrewtoth commented at 10:55 PM on June 14, 2026: contributor

Thanks! Yes, collisions are rare enough that this will not have a noticeable read penalty. A collision may incur an extra disk read, deserialization and sha256 hash.

sipa commented at 11:53 PM on June 14, 2026: member

If existence of collisions isn't the criterion to judge this by, but the expected read amplification from those collisions, we can possibly go even lower?

With 5-byte salted hashes, the expected amplication factor is under 1.002 per read, up to 2 billion txids, and would save a few extra gigabytes. That might even be a bigger speed win than the loss from that amplication. 4-byte hashes would give an amplication that's probably too much.

andrewtoth force-pushed on Jun 15, 2026

andrewtoth renamed this:
~~txindex: use 8-byte siphash keys to optimize disk usage~~
txindex: use 5-byte siphash keys to optimize disk usage
on Jun 15, 2026

andrewtoth commented at 2:30 AM on June 15, 2026: contributor

Updated to use 5-byte siphash. The reindex was faster by 8 minutes, and the db size is now 32 GB. I got 860k 2-way collisions now though, instead of 2 before, and I got 1 3-way collision. The read penalty was not really measurable from my machine though, likely because I am using a fast laptop with fast directly connected storage.

l0rinc commented at 10:40 AM on June 15, 2026: contributor

Concept ACK, will play with this after we're done with the compactions

in src/index/txindex.cpp:53 in e3f35ee417

  48 | +    return TxHashKeyPrefix{
  49 | +        static_cast<uint8_t>(siphash >> 56),
  50 | +        static_cast<uint8_t>(siphash >> 48),
  51 | +        static_cast<uint8_t>(siphash >> 40),
  52 | +        static_cast<uint8_t>(siphash >> 32),
  53 | +        static_cast<uint8_t>(siphash >> 24),

optout21 commented at 11:05 AM on June 15, 2026:

e3f35ee txindex: use 5-byte siphash keys to optimize disk usage:

Can the const offsets be given in hex? Structure shows better through the hex numbers 18, 20, 28, 30, 38.

optout21 commented at 11:08 AM on June 15, 2026:

e3f35ee txindex: use 5-byte siphash keys to optimize disk usage:

I may be shooting in the dark, but could it be that doing the right shifts incrementally (i.e., first by 24, then 4 times by 8; placed in reverse order), is slightly more efficient? (shifts with smaller size; a micro-optimization).

andrewtoth commented at 1:03 AM on June 16, 2026:

Done.

andrewtoth commented at 1:04 AM on June 16, 2026:

Hmm not sure that could really make a difference that would be visible to a user or to benchmarks here?

optout21 commented at 11:55 AM on June 16, 2026:

I was able to measure only minimal differences in performance. I microbenchmarked only the shift operations. The result was 3% speedup, or 0.000486 microsec per iteration (1518827 microsec vs 1470169, for 100000000 iterations). I think this is negligible.

Please set this thread to Resolved.

The two versions compared were:

            buf[0] = static_cast<uint8_t>(siphash >> 0x38);
            buf[1] = static_cast<uint8_t>(siphash >> 0x30);
            buf[2] = static_cast<uint8_t>(siphash >> 0x28);
            buf[3] = static_cast<uint8_t>(siphash >> 0x20);
            buf[4] = static_cast<uint8_t>(siphash >> 0x18);

            siphash >>= 24;
            buf[4] = static_cast<uint8_t>(siphash);
            siphash >>= 8;
            buf[3] = static_cast<uint8_t>(siphash);
            siphash >>= 8;
            buf[2] = static_cast<uint8_t>(siphash);
            siphash >>= 8;
            buf[1] = static_cast<uint8_t>(siphash);
            siphash >>= 8;
            buf[0] = static_cast<uint8_t>(siphash);

sipa commented at 12:12 PM on June 16, 2026:

If the CPU performance of constructing the hash is at all relevant (I don't know), we could consider a faster hash function (like #35215).

andrewtoth commented at 12:36 PM on June 16, 2026:

we could consider a faster hash function (like #35215).

The unfortunate thing about this use case is that we would have to decide on this before a release, since changing the hash function after the fact would require a reindex.

sipa commented at 1:26 PM on June 16, 2026:

Indeed, changing it is painful once released.

Just to assess whether that's worth investigating at all, would someone benchmark with and without the simplified siphash there?

andrewtoth commented at 1:41 PM on June 16, 2026:

I was looking to do that :). But, the hash function there is optimized for COutPoint, which has an extra 4 bytes after the uint256. @l0rinc is that sufficiently faster if we just pass a constant as the extra, or is there a more optimized version we can run that omits the extra field?

optout21 commented at 1:48 PM on June 16, 2026:

Yes, #35215 was optimization for the extra vout 8 bytes, which is not the case here. A speedup could be obtained with a hash that internally works with 5-byte-only values, but on 64-bit architecture that's not really faster than operations with 8-byte values, so I don't see an easy win here.

What could be measured as a boundary data point is just taking 5 bytes of the TXID without any extra hashing/salting, and if the speedup is significant, considered.

sipa commented at 1:50 PM on June 16, 2026:

I think you have it backwards, @optout21.

We're discussing replacing the SipHash function used to compute the 5-byte value; it's not using it as an input.

The input we need here is the txid. In #35215 the input is a COutPoint.

l0rinc commented at 1:52 PM on June 16, 2026:

@l0rinc is that sufficiently faster if we just pass a constant as the extra, or is there a more optimized version we can run that omits the extra field?

We can add a versions without the extra of course. I will help with benchmarking this after the compaction work is behind us - unless you think this is more urgent for some reasonn.

andrewtoth commented at 1:53 PM on June 16, 2026:

A tangent but this now got me thinking, if it works well for txindex, we could extend this method to chainstate. siphash the CoinEntry as key prefix, append the Coin to the key, and seek to the prefix and scan for the right outpoint.

sipa commented at 1:54 PM on June 16, 2026:

@andrewtoth I think taking the same design approach into account, we can drop 1 extra SipHash round from the construction in #35215 if there is no extra uint32_t to add, so 4 instead of 5 rounds (compared to 14 rounds with traditional SipHash-2-4).

l0rinc commented at 1:56 PM on June 16, 2026:

Yes, that's what I mean. I don't mind adding it to #35215 if you think it's a good idea.

andrewtoth commented at 2:06 PM on June 16, 2026:

seek to the prefix and scan for the right outpoint.

Actually that won't work, because we don't have the outpoint we can reconstruct like we can the txid from the transaction we read.

optout21 commented at 3:48 PM on June 16, 2026:

We're discussing replacing the SipHash function used to compute the 5-byte value

Yes, I didn't mean otherwise. My point was (maybe not clearly expressed) that internally SipHash works with 8-byte values, which is a waste, if in the end only a 5-byte hash is needed. A custom version working with 5-byte values is conceivable, but since 64-bit CPUs are optimized for 64 bit width, it probably wouldn't be faster.

My other point was that to get an upper bound on the speedup possible through tweaking the hash, it's possible to measure an oversimplified solution, where no hash is used at all, but 5 bytes are taken directly from the 32-byte TXID (which is itself a hash). I'm not saying that this would be an acceptable solution (thought it might), but it would be faster than any optimized hash (to get the 5 bytes).

sipa commented at 4:14 PM on June 16, 2026:

@optout21 I see what you mean now, but that is not what we're talking about.

SipHash, and all variants of it, internally work with 64-bit values, that's not going to change. It would be an entirely distinct hash function, which needs separate analysis, to change that. All designs just truncate the final 64-bit value to 40-bit.

What is being discussed now is independent from the 8/5 byte question. The idea is that, since this PR bites the bullet in changing the data layout to be hash-based, we might as well pick an efficient hash function. Right now, this PR uses normal SipHash-2-4. @l0rinc's PR linked above introduces a more experimental (and custom) variant of SipHash-1-3 with jumboblocks and without padding. Using that same variant here would mean a construction that only needs 4 SipHash rounds (all operating on four internal 64-bit values) rather than 14 SipHash rounds. This same change would be possible even if this PR used 8-byte ids.

l0rinc commented at 10:14 PM on June 16, 2026:

See if you can use https://github.com/bitcoin/bitcoin/pull/35215/changes/4d00740e2921c09a717bcf1964b94780a64757bb#diff-66e9563fc5032b8ac0ab910034faadf65cbbb282bee6c46d5c92134cb86c66fdR116-R137 here

andrewtoth commented at 11:33 PM on June 18, 2026:

Ran some benchmarks, 2 runs each of master, siphash24, jumbo siphash13, and just taking the first 5 bytes (unsafe, but the best we can expect to get).

Siphash24 is 13 minutes faster than master, and siphash13 is another 90 seconds faster than that. The raw first 5 byte variant was only 11s faster than siphash13, so we're very close to the theoretical limit.

I have a very fast CPU, so a slower machine might show a bigger speedup with siphash13. An extra 90 seconds though doesn't seem like a deal breaker for me though. If we can get the siphash13 in before this is released we should definitely take it though.

Variant	min sync	avg sync	max sync	avg (h:m:s)	size (GiB)	vs master
master (full 32B txid)	6286s	6355s	6424s	1h45m55s	65.85	—
normal (5B SipHash-2-4)	5550s	5578s	5607s	1h32m58s	32.18	−12.2% time
jumbo (5B SipHash-1-3)	5437s	5488s	5538s	1h31m27s	32.07	−13.7% time
raw5 (first 5B of txid)	5417s	5476s	5536s	1h31m16s	32.04	−13.8% time

andrewtoth commented at 12:28 AM on July 26, 2026:

Updated to use the new siphash :rocket:

optout21 commented at 11:37 AM on June 15, 2026: contributor

Concept ACK (e3f35ee4171875124104b404464151f9b3da1566)

The reduction of index size is very positive!

The approach for graceful DB update is interesting, it's nice that reindex is not forced.

Is there a benchmark that exposes the effect of this change?

Out of curiosity, what was the number of bytes used before switching to 5? 8? And the resulting size? (unfortunately the original version & values were not preserved in the description).

andrewtoth force-pushed on Jun 16, 2026

andrewtoth commented at 1:11 AM on June 16, 2026: contributor

@optout21 the initial run was using an 8-byte siphash. The db was 36 GB and it took 1h40m.

Is there a benchmark that exposes the effect of this change?

The goal was to shrink the db size on disk, but of course if performance was negatively affected it might not be worth it. There are two relevant metrics - syncing the index and reading entries. It seems the former gets a nice bump from this as well. I am assuming it's mostly from the fact that we write a lot less data to disk.

Benchmarking the sync speed can be done by deleting the /indexes/txindex directory in the datadir, and then restarting and waiting for the txindex is enabled at height log line. This can be compared to the txindex thread start log line to get the delta.

Benchmarking the read speed can be done with the following script using apache bench:

TXID="<txid>"
printf '{"jsonrpc":"1.0","id":"ab","method":"getrawtransaction","params":["%s",0]}\n' "$TXID" > /tmp/data.json
ab -n 10000 -c 1 -k -A user:password -p /tmp/data.json -T application/json http://127.0.0.1:8332/

andrewtoth force-pushed on Jun 16, 2026

DrahtBot added the label CI failed on Jun 16, 2026

DrahtBot commented at 1:19 AM on June 16, 2026: contributor

🚧 At least one of the CI tasks failed. Task iwyu: https://github.com/bitcoin/bitcoin/actions/runs/27586700495/job/81558529020 LLM reason (✨ experimental): CI failed because the IWYU (include-what-you-use) check detected missing includes and forced a formatting/patch to src/index/txindex.cpp, causing the job to exit non-zero.

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

andrewtoth force-pushed on Jun 16, 2026

DrahtBot removed the label CI failed on Jun 16, 2026

theStack commented at 2:57 PM on June 16, 2026: contributor

Concept ACK

mzumsande commented at 5:18 PM on June 16, 2026: contributor

In case of a downgrade, the behavior is not ideal. Nothing will break and there will be no warnings or errors, but some transactions won't be returned after getrawtransaction queries even though they exist, while others will be returned normally - depending on which version indexed them.

So I think that this would need to be documented well. Alternatively, an upgrade similar to how it was done with coinstatsindex in 30.0 might be cleaner - especially since the txindex syncs very fast in comparison.

andrewtoth commented at 10:11 PM on June 16, 2026: contributor

@mzumsande good observation.

We could make ReadBestBlock/WriteBestBlock virtual, and override them in txindex to use a new locator, say Bv2, that is used to track the latest block. If Bv2 isn't found, lookup B and start from there. That way a downgrade will ignore newer hashed entries and resync from where the latest legacy entries are.

in src/index/txindex_key.h:19 in 94d7377eda

  14 | +#include <cstddef>
  15 | +#include <cstdint>
  16 | +#include <ios>
  17 | +
  18 | +namespace txindex {
  19 | +constexpr uint8_t DB_TXINDEX_HASHED{'T'};

optout21 commented at 9:22 AM on June 17, 2026:

94d7377 txindex: use 5-byte siphash keys to optimize disk usage:

Nit: The 'T' could be confused with the legacy 't', discussing/debugging/etc. in a mixed legacy-new DB environment, maybe a different letter could be picked, to reduce the risk of confusion.

andrewtoth commented at 12:19 AM on June 19, 2026:

Updated to 'x'.

optout21 commented at 9:25 AM on June 17, 2026: contributor

LGTM! Reviewed code, tested lightly locally, including reindex (upgrade and downgrade). Not ack'ing now, as I can see the pending points:

Optimized hash from #35215. This is not a must, can do without (but if it lands earlier, it should be taken; if not, can be done later.)
Discussion about downgrade scenario.

sedited commented at 9:51 AM on June 17, 2026: contributor

Concept ACK

As for the fallback, I'm curious how much slower this makes transaction querying. Do you think a forced migration to the new DB would be too expensive?

andrewtoth commented at 2:13 PM on June 18, 2026: contributor

@sedited I don't think the new queries are noticeably slower. It's one more read of a non-existent entry.

We could do a migration like coinstatsindex in v30, where we keep the old db and reindex in a new directory as @mzumsande suggested. I assumed it would be better to have a graceful upgrade, but I suppose every user will want to reindex to reap the disk savings. In that case, we can write release notes to tell users to wipe their old txindex directory if they are not planning on downgrading? This path will also make the code changes simpler since we don't have to support upgrade and downgrade.

What does everyone think - support graceful upgrade/downgrade, or just index into a new directory?

l0rinc commented at 2:21 PM on June 18, 2026: contributor

What does everyone think - support graceful upgrade/downgrade, or just index into a new directory?

Wouldn't on-demand migration provide the possibility to do both, as you mentioned? Start immediately and the system will fall back to amortized O(1) migration, or delete everything and do an O(n) migration? (note that I still haven't reviewed it in detail, only responding to the question)

mzumsande commented at 2:36 PM on June 18, 2026: contributor

What does everyone think - support graceful upgrade/downgrade, or just index into a new directory?

The exact same procedure would probably not be a good idea - if we'd keep the old index with its 66GB around by default, users who don't do anything manually (which are probably most) would experience a increase in disk space because they would have both. That was not a problem for coinstatsindex because it is much smaller.

optout21 commented at 2:53 PM on June 18, 2026: contributor

The graceful upgrade without forced reindex is very user-friendly (at the price of hybrid data in the DB, and logic to try to read both). The downgrade being the less frequent use case, I think it's acceptable to require a reindex in that case. However, the question is how to prevent the old code from using the existing DB. Maybe at graceful upgrade rename the index DB (but keep its content), so the old code will not find it. Just an idea.

sedited commented at 7:16 PM on June 18, 2026: contributor

What does everyone think - support graceful upgrade/downgrade, or just index into a new directory?

I think what you have now is fine tbh. I asked the question before because I was curious to hear some of your reasoning.

andrewtoth force-pushed on Jun 19, 2026

DrahtBot added the label CI failed on Jun 19, 2026

DrahtBot commented at 12:29 AM on June 19, 2026: contributor

🚧 At least one of the CI tasks failed. Task No wallet: https://github.com/bitcoin/bitcoin/actions/runs/27797474238/job/82260299474 LLM reason (✨ experimental): CI failed due to a Clang -Werror compile error: deleting BaseIndex::DB with virtual functions but a non-virtual destructor (-Wdelete-non-abstract-non-virtual-dtor).

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

andrewtoth commented at 12:36 AM on June 19, 2026: contributor

Thanks for all your responses. Seems we're all in agreement about keeping graceful upgrades and downgrades.

I pushed a commit 736410a1739fe8fd5c7c4140929d115ace6475ee that updates the block locator for txindex, and reads the legacy block locator if the new version is not yet written. This lets us continue to use the legacy index when upgrading, and an older node will ignore hashed entries when downgrading and resync from where the legacy locator left off.

If a user wipes their txindex and resyncs with the new hashed entries, then downgrades, they will reindex legacy again and have both indexes in one db. I think that's ok though, we should just mention in the release notes to wipe the db only if you don't plan on downgrading.

DrahtBot removed the label CI failed on Jun 19, 2026

Sjors commented at 1:06 PM on June 19, 2026: member

You could further shrink the index by letting go of the requirement that that -txindex also indexes the block (via CDiskTxPos.nPos). Pointing directly to the file position in the block file shaves off a few bytes. You could encode the block height, which still saves some bytes compared to the current approach, but then you don't know if a transaction is in the best chain or orphaned. This may not be desirable, so it's probably best to keep CDiskTxPos as is.

andrewtoth commented at 2:51 PM on June 19, 2026: contributor

@Sjors yeah, if we use the height we could give a wrong block hash back for an orphaned tx. Will leave as is.

One thing we could do to shave off some more GBs is remove bloom filters, a la #35568, but that will likely slow down the legacy lookups. If we didn't care about graceful upgrading we could do that.

Sjors commented at 3:06 PM on June 19, 2026: member

@andrewtoth you could conditionally disable it for newly created indexes? The release note could mention that disk space savings can be achieved by deleting the existing index.

optout21 commented at 3:46 PM on June 19, 2026: contributor

@Sjors, do you mean that the two positions could be merged? There are two uint32_t positions: nPos is the offset of the block within the file, and nTxOffset is the offset of the TX within the block. There are also two seeks to get to the correct position.

andrewtoth commented at 3:52 PM on June 19, 2026: contributor

@optout21 I believe the suggestion is to just have one position which points to the position of the transaction in the file, not the block. Then, we can also encode the height of the block in 3 bytes and look up the hash in CBlockIndex. But, the nPos and nTxOffset are both encoded as varints, so this might not actually save much space with the extra 3 bytes.

andrewtoth commented at 4:01 PM on June 19, 2026: contributor

you could conditionally disable it for newly created indexes @Sjors good idea, we could skip bloom filters and legacy lookups if we don't see any 't' entries in the db. We would have to peek inside it before opening it though, since the bloom filters are a startup option. I think I might leave that for a follow-up to add to #35568 if this gets merged.

l0rinc commented at 7:48 PM on June 19, 2026: contributor

Q: Could this key format change consider future prune compatibility? It's probably out of scope, but maybe worth taking into account here, since if the index could identify the containing block independently of local block files, a pruned node (with the header chain) could potentially try to fetch that block on demand in a follow-up.

Sjors commented at 7:54 PM on June 19, 2026: member

@l0rinc we could, if we keep the original CDiskTxPos.nPos around for pruned blocks, we can infer which block we're missing.

andrewtoth commented at 7:57 PM on June 19, 2026: contributor

if the index could identify the containing block independently of local block files

Interesting idea. If we did store the block height instead of file position, we could do that. But, we would need to solve the problem of finding the right block if the tx requested is in a block not part of the best chain.

if we keep the original CDiskTxPos.nPos around for pruned blocks, we can infer which block we're missing. @Sjors interesting, can you elaborate?

sipa commented at 8:08 PM on June 19, 2026: member

We could store a single number, 4000000*block_height + tx_offset.

It can even use an integer encoding without length prefix, as that can be inferred from the length of the serialized value record.

Shouls be 6 bytes for the forseeable future on mainnet.

This does imply erasing entries for reorged/disconnected blocks, as otherwise we'll have nonsensical tx offsets.

andrewtoth commented at 8:29 PM on June 19, 2026: contributor

erasing entries for reorged/disconnected blocks

~hmm that could require us to do manual compactions on the txindex db then...~ Actually, these entries can be overwritten already on reorgs, so it doesn't change current behavior.

andrewtoth force-pushed on Jun 24, 2026

andrewtoth renamed this:
~~txindex: use 5-byte siphash keys to optimize disk usage~~
txindex: hash keys and pack positions to reduce disk usage
on Jun 24, 2026

andrewtoth force-pushed on Jun 24, 2026

DrahtBot added the label CI failed on Jun 24, 2026

DrahtBot commented at 1:07 AM on June 24, 2026: contributor

🚧 At least one of the CI tasks failed. Task iwyu: https://github.com/bitcoin/bitcoin/actions/runs/28066837312/job/83092910222 LLM reason (✨ experimental): CI failed because IWYU detected and required missing/incorrect #include fixes in src/index/txindex.cpp (triggering “Failure generated from IWYU”).

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

andrewtoth commented at 1:20 AM on June 24, 2026: contributor

@sipa I updated this with your suggestion to encode the block height and transaction file position using max serialized block size as a mask. Thanks!

This reduced the db to 29GB, and it synced in 1h27m :rocket:

So we must also delete entries when blocks are disconnected now. This makes the diff a little bigger to review, but in return it opens the door to allow txindex with pruning. We could fetch blocks JIT from peers when a tx is requested from a missing block. cc @l0rinc

DrahtBot removed the label CI failed on Jun 24, 2026

Sjors commented at 9:26 AM on June 24, 2026: member

IIUC we lose the ability find transactions in stale blocks (that are not included in the canonical chain). Test that passes before and fails after this PR:

diff --git a/test/functional/rpc_rawtransaction.py b/test/functional/rpc_rawtransaction.py
index 78e12139fc..4e4f85d1f8 100755
--- a/test/functional/rpc_rawtransaction.py
+++ b/test/functional/rpc_rawtransaction.py
@@ -172,4 +172,14 @@ class RawTransactionsTest(BitcoinTestFramework):
             gottx = self.nodes[n].getrawtransaction(txid=tx, verbose=True, blockhash=block1)
             assert_equal(gottx['in_active_chain'], False)
+            if n == 0:
+                self.log.info("Test getrawtransaction with -txindex can find a stale block transaction without blockhash")
+                coinbase_txid = self.nodes[n].getblock(block1)["tx"][0]
+                # Mine another block so txindex processes the reorg and calls CustomRemove()
+                # for the stale block before querying it.
+                self.generate(self.nodes[n], 1, sync_fun=self.no_op)
+                sync_txindex(self, self.nodes[n])
+                raw_tx = self.nodes[n].getrawtransaction(txid=coinbase_txid, verbose=True)
+                assert_equal(raw_tx["txid"], coinbase_txid)
+                assert_equal(raw_tx["blockhash"], block1)
             self.nodes[n].reconsiderblock(block1)
             assert_equal(self.nodes[n].getbestblockhash(), block2)

It's worth pointing that out in the description. It might be fine, but we could preserve this functionality by switching the key to use the block hash upon disconnect, instead of erasing:

<details> <summary>patch</summary>

diff --git a/src/index/txindex.cpp b/src/index/txindex.cpp
index e1f42d1fa7..b010161667 100644
--- a/src/index/txindex.cpp
+++ b/src/index/txindex.cpp
@@ -40,4 +40,24 @@ const std::string DB_BEST_BLOCK_V2{"best_block_v2"};
 std::unique_ptr<TxIndex> g_txindex;

+namespace txindex {
+constexpr uint8_t DB_TXINDEX_STALE{'y'};
+
+struct StaleDBKey {
+    TxHashKeyPrefix hash_prefix;
+    uint256 block_hash{};
+    uint32_t tx_offset{0};
+
+    SERIALIZE_METHODS(StaleDBKey, obj)
+    {
+        uint8_t prefix{DB_TXINDEX_STALE};
+        READWRITE(prefix);
+        if (prefix != DB_TXINDEX_STALE) {
+            throw std::ios_base::failure("Invalid format for stale txindex DB key");
+        }
+        READWRITE(obj.hash_prefix, obj.block_hash, obj.tx_offset);
+    }
+};
+} // namespace txindex
+

 /** Access to the txindex database (indexes/txindex/) */
@@ -51,5 +71,5 @@ public:
     bool ReadTxPos(const Txid& txid, CDiskTxPos& pos) const;

-    /// Write or erase a block of transaction positions to the DB.
+    /// Write active entries or move active entries to stale entries for a block.
     void WriteTxs(const interfaces::BlockInfo& block, bool erase = false);

@@ -99,8 +119,9 @@ void TxIndex::DB::WriteTxs(const interfaces::BlockInfo& block, bool erase)
     uint32_t tx_offset{GetSizeOfCompactSize(block.data->vtx.size())};
     for (const auto& tx : block.data->vtx) {
-        const txindex::DBKey key{txindex::CreateKeyPrefix(m_hasher, tx->GetHash()),
-                                 txindex::Position{static_cast<uint32_t>(block.height), tx_offset}};
+        const txindex::TxHashKeyPrefix hash_prefix{txindex::CreateKeyPrefix(m_hasher, tx->GetHash())};
+        const txindex::DBKey key{hash_prefix, txindex::Position{static_cast<uint32_t>(block.height), tx_offset}};
         if (erase) {
             batch.Erase(key);
+            batch.Write(txindex::StaleDBKey{hash_prefix, block.hash, tx_offset}, "");
         } else {
             batch.Write(key, "");
@@ -149,15 +170,6 @@ bool TxIndex::FindTx(const Txid& tx_hash, uint256& block_hash, CTransactionRef&
     txindex::DBKey key{prefix, {}};
     const auto header_offset{static_cast<uint32_t>(GetSerializeSize(CBlockHeader{}))};
-    for (; it->Valid() && it->GetKey(key) && key.hash_prefix == prefix; it->Next()) {
-        FlatFilePos tx_pos;
-        uint256 candidate_block_hash;
-        {
-            LOCK(cs_main);
-            const CBlockIndex* pindex{m_chainstate->m_chain[key.pos.block_height]};
-            if (!pindex) continue;
-            tx_pos = FlatFilePos{pindex->nFile, pindex->nDataPos + header_offset + key.pos.tx_offset};
-            candidate_block_hash = pindex->GetBlockHash();
-        }
-        AutoFile file{m_chainstate->m_blockman.OpenBlockFile(tx_pos, true)};
+    const auto read_tx_at_pos{[&](const FlatFilePos& pos) {
+        AutoFile file{m_chainstate->m_blockman.OpenBlockFile(pos, true)};
         if (file.IsNull()) {
             LogError("OpenBlockFile failed");
@@ -170,4 +182,17 @@ bool TxIndex::FindTx(const Txid& tx_hash, uint256& block_hash, CTransactionRef&
             return false;
         }
+        return true;
+    }};
+    for (; it->Valid() && it->GetKey(key) && key.hash_prefix == prefix; it->Next()) {
+        FlatFilePos tx_pos;
+        uint256 candidate_block_hash;
+        {
+            LOCK(cs_main);
+            const CBlockIndex* pindex{m_chainstate->m_chain[key.pos.block_height]};
+            if (!pindex) continue;
+            tx_pos = FlatFilePos{pindex->nFile, pindex->nDataPos + header_offset + key.pos.tx_offset};
+            candidate_block_hash = pindex->GetBlockHash();
+        }
+        if (!read_tx_at_pos(tx_pos)) return false;
         if (tx->GetHash() == tx_hash) {
             block_hash = candidate_block_hash;
@@ -176,4 +201,21 @@ bool TxIndex::FindTx(const Txid& tx_hash, uint256& block_hash, CTransactionRef&
     }

+    it->Seek(std::pair{txindex::DB_TXINDEX_STALE, prefix});
+    txindex::StaleDBKey stale_key{prefix};
+    for (; it->Valid() && it->GetKey(stale_key) && stale_key.hash_prefix == prefix; it->Next()) {
+        FlatFilePos tx_pos;
+        {
+            LOCK(cs_main);
+            const CBlockIndex* pindex{m_chainstate->m_blockman.LookupBlockIndex(stale_key.block_hash)};
+            if (!pindex || !(pindex->nStatus & BLOCK_HAVE_DATA)) continue;
+            tx_pos = FlatFilePos{pindex->nFile, pindex->nDataPos + header_offset + stale_key.tx_offset};
+        }
+        if (!read_tx_at_pos(tx_pos)) return false;
+        if (tx->GetHash() == tx_hash) {
+            block_hash = stale_key.block_hash;
+            return true;
+        }
+    }
+
     // Fallback to legacy if no hashed entry matched.
     CDiskTxPos postx;
diff --git a/src/test/txindex_tests.cpp b/src/test/txindex_tests.cpp
index 18913291b8..a16899d631 100644
--- a/src/test/txindex_tests.cpp
+++ b/src/test/txindex_tests.cpp
@@ -180,5 +180,5 @@ BOOST_FIXTURE_TEST_CASE(txindex_collision_scan_path, TestChain100Setup)
 }

-BOOST_FIXTURE_TEST_CASE(txindex_reorg_erases_entries, TestChain100Setup)
+BOOST_FIXTURE_TEST_CASE(txindex_reorg_keeps_stale_entries, TestChain100Setup)
 {
     TxIndex txindex(interfaces::MakeChain(m_node), 1_MiB, true);
@@ -205,4 +205,5 @@ BOOST_FIXTURE_TEST_CASE(txindex_reorg_erases_entries, TestChain100Setup)
     BOOST_REQUIRE(txindex.FindTx(unique_txid, block_hash, tx_disk));
     BOOST_CHECK(tx_disk->GetHash() == unique_txid);
+    const uint256 stale_block_hash{block_hash};

     ChainstateManager& chainman{*m_node.chainman};
@@ -218,6 +219,8 @@ BOOST_FIXTURE_TEST_CASE(txindex_reorg_erases_entries, TestChain100Setup)
     BOOST_REQUIRE(txindex.BlockUntilSyncedToCurrentChain());

-    // The disconnected transaction's entry must have been erased.
-    BOOST_CHECK(!txindex.FindTx(unique_txid, block_hash, tx_disk));
+    // The disconnected transaction must still be found through the stale block entry.
+    BOOST_REQUIRE(txindex.FindTx(unique_txid, block_hash, tx_disk));
+    BOOST_CHECK(tx_disk->GetHash() == unique_txid);
+    BOOST_CHECK(block_hash == stale_block_hash);

     txindex.Stop();

</details>

This is arguably wasteful for transactions that are included in the canonical chain, but we could (later) expand the RPC to be able to find them. Stale blocks are pretty rare on mainnet anyway, so it's not much extra data.

andrewtoth commented at 1:52 PM on June 24, 2026: contributor

we lose the ability find transactions in stale blocks @Sjors I updated the PR description to note that breaking change. There is a unit test for this new functionality.

It might be fine, but we could preserve this functionality by switching the key to use the block hash upon disconnect, instead of erasing:

What should we do here? I think (could be wrong) that's not really a feature many users use, and users who are syncing the index from scratch also don't get that feature (since the initial sync won't index non-canonical blocks) so it's not really deterministic anyways. IMO it would be fine to remove that functionality and address it in a release note.

Users can also work around this by just passing the block hash to getrawtransaction, as you note it only breaks if there's no block hash.

Sjors commented at 2:51 PM on June 24, 2026: member

updated the PR description to note that breaking change

Let's also mention it in the release note.

not really a feature many users use

Probably not. I could imagine it's useful for external wallet / lightning software, to know that a given transaction now lives in a stale block. But as long as they kept track of the block hash, they don't need the index.

It might not hurt to (have an agent) scan for projects that use -txindex to see if any rely on this behavior.

so it's not really deterministic anyways

That and the extra complexity seem good reasons to not bother supporting it.

andrewtoth force-pushed on Jun 25, 2026

andrewtoth commented at 2:22 AM on June 25, 2026: contributor

Thanks @Sjors, added release notes.

in src/index/txindex.cpp:67 in 1d5b61c400

  66 |  TxIndex::DB::DB(size_t n_cache_size, bool f_memory, bool f_wipe) :
  67 | -    BaseIndex::DB(gArgs.GetDataDirNet() / "indexes" / "txindex", n_cache_size, f_memory, f_wipe)
  68 | +    BaseIndex::DB(gArgs.GetDataDirNet() / "indexes" / "txindex", n_cache_size, f_memory, f_wipe),
  69 | +    m_hasher{[](CDBWrapper& db) {
  70 | +        std::pair<uint64_t, uint64_t> siphash_key;
  71 | +        if (!db.Read("siphash_key", siphash_key)) {

l0rinc commented at 9:18 PM on June 26, 2026:

1d5b61c txindex: hash keys and pack positions to reduce disk usage:

could we make siphash_key named by its role, not the implementation detail? The siphash doc calls them keys but for the dbcache we're calling them salts:

static constexpr std::string DB_TXID_HASH_SALT{"txid_hash_salt"};

in src/index/txindex.cpp:68 in 1d5b61c400 outdated

  67 | -    BaseIndex::DB(gArgs.GetDataDirNet() / "indexes" / "txindex", n_cache_size, f_memory, f_wipe)
  68 | +    BaseIndex::DB(gArgs.GetDataDirNet() / "indexes" / "txindex", n_cache_size, f_memory, f_wipe),
  69 | +    m_hasher{[](CDBWrapper& db) {
  70 | +        std::pair<uint64_t, uint64_t> siphash_key;
  71 | +        if (!db.Read("siphash_key", siphash_key)) {
  72 | +            FastRandomContext rng{};

l0rinc commented at 9:19 PM on June 26, 2026:

1d5b61c txindex: hash keys and pack positions to reduce disk usage:

Do we ever need to make this deterministic?

andrewtoth commented at 9:00 PM on June 30, 2026:

I don't see a reason to.

in src/index/txindex.cpp:106 in 1d5b61c400

 104 | +        const txindex::DBKey key{txindex::CreateKeyPrefix(m_hasher, tx->GetHash()),
 105 | +                                 txindex::Position{static_cast<uint32_t>(block.height), tx_offset}};
 106 | +        if (erase) {
 107 | +            batch.Erase(key);
 108 | +        } else {
 109 | +            batch.Write(key, "");

l0rinc commented at 9:20 PM on June 26, 2026:

1d5b61c txindex: hash keys and pack positions to reduce disk usage:

So basically leveldb is storing a sorted set now instead of a map. I wonder if we could tune LevelDB better for this case. nit:

            batch.Write(key, ""); // The tx position is encoded in the key, so the value is intentionally empty

Actually, it seems that this results in extra data for the "empty" value, which is actually encoded as a one-element \0, see:

BOOST_AUTO_TEST_CASE(dbwrapper_empty_string_vs_span)
{
    const auto batch_size{[&](const auto& value) {
        CDBWrapper dbw({.path = m_args.GetDataDirBase() / "empty_string_vs_span", .cache_bytes = 1_MiB, .memory_only = true});
        CDBBatch batch(dbw);
        batch.Write(0, value);
        return batch.ApproximateSize();
    }};
    BOOST_CHECK_EQUAL(batch_size(""), batch_size(std::span<std::byte>{})); // Fails with: 20 != 19
}

This would likely save us roughly 1.4 GB of logical payload currently. I wonder if we could fix TxoSpenderIndex as well.

andrewtoth commented at 9:00 PM on June 30, 2026:

Tested this - 27 GB synced in 1h23m :rocket:

in src/index/txindex.cpp:150 in 1d5b61c400

 153 |  {
 154 | +    const txindex::TxHashKeyPrefix prefix{txindex::CreateKeyPrefix(m_db->m_hasher, tx_hash)};
 155 | +    std::unique_ptr<CDBIterator> it{m_db->NewIterator()};
 156 | +    it->Seek(std::pair{txindex::DB_TXINDEX_HASHED, prefix});
 157 | +    txindex::DBKey key{prefix, {}};
 158 | +    const auto header_offset{static_cast<uint32_t>(GetSerializeSize(CBlockHeader{}))};

l0rinc commented at 9:24 PM on June 26, 2026:

1d5b61c txindex: hash keys and pack positions to reduce disk usage:

nit: can we use the fixed header size constant to e.g. CBlockHeader::SERIALIZED_SIZE here instead? Should likely be done in another PR before this.

andrewtoth commented at 9:02 PM on June 30, 2026:

I removed the header from the read path. It's computed in the write path at the start of a block instead. Having the position be after the header was done because previously header was read first and then seeked to tx. Now we don't read header and can seek right to tx.

re: the size constant, can be done before or after, it's not a blocker for this PR IMO.

in src/index/txindex.cpp:177 in 1d5b61c400

 180 | +            block_hash = block_index->GetBlockHash();
 181 | +            return true;
 182 | +        }
 183 | +    }
 184 | +
 185 | +    // Fallback to legacy if no hashed entry matched.

l0rinc commented at 9:25 PM on June 26, 2026:

1d5b61c txindex: hash keys and pack positions to reduce disk usage:

A txindex miss now searches the hashed bucket before falling back to the legacy full-txid key. Could we document at the fallback why that extra lookup is intentional? And preferably extract the two independent algorithms to local helpers.

    // Fallback to legacy if no hashed entry matched. This makes misses pay an
    // extra lookup, but keeps existing full-txid entries readable after upgrade.

Q: is it possible to use the https://en.wikipedia.org/wiki/Strategy_pattern here to chose either of the 3 combinations: always use legacy, use new with fallback, always use new. In that case the decision is only taken once, after that we only ever use one of the searches.

andrewtoth commented at 9:04 PM on June 30, 2026:

Added the comment.

re: strategy - we don't need the always legacy case. That's not something that can happen if we're using a node with an advancing chain.

We can check the db before opening if it contains any legacy keys, and use that to determine if we need to do a fallback. We can also open without bloom filters in that case. I opted to wait and see if #35568 gets merged before doing that though. Can be done safely in a follow-up.

andrewtoth commented at 8:25 PM on July 12, 2026:

Updated to peek into the db to check for any legacy entries. If we don't find any, we skip the fallback lookup. We also disable bloom filters in that case, which resulted in another GB less data (26 GB) and 4 minutes off the sync time :rocket:.

in src/index/txindex_key.h:67 in 1d5b61c400

  62 | +
  63 | +using TxHashKeyPrefix = std::array<uint8_t, 5>;
  64 | +
  65 | +inline TxHashKeyPrefix CreateKeyPrefix(const PresaltedSipHasher& hasher, const Txid& txid)
  66 | +{
  67 | +    const uint64_t siphash{hasher(txid.ToUint256())};

l0rinc commented at 9:27 PM on June 26, 2026:

1d5b61c txindex: hash keys and pack positions to reduce disk usage:

Does endianness matter here? Don't we need a htobe64_internal call here? It would be preferable to be able to copy these between architectures. We could copy the normalized version directly into the array:

using TxHashKeyPrefix = std::array<std::byte, 5>;

inline TxHashKeyPrefix CreateKeyPrefix(const PresaltedSipHasher& hasher, const Txid& txid)
{
    const uint64_t hash{htobe64_internal(hasher(txid.ToUint256()))};
    TxHashKeyPrefix prefix;
    std::memcpy(prefix.data(), &hash, prefix.size());
    return prefix;
}

in src/index/txindex_key.h:63 in 1d5b61c400

  58 | +        block_height = static_cast<uint32_t>(code / MAX_BLOCK_SERIALIZED_SIZE);
  59 | +        tx_offset = static_cast<uint32_t>(code % MAX_BLOCK_SERIALIZED_SIZE);
  60 | +    }
  61 | +};
  62 | +
  63 | +using TxHashKeyPrefix = std::array<uint8_t, 5>;

l0rinc commented at 9:29 PM on June 26, 2026:

1d5b61c txindex: hash keys and pack positions to reduce disk usage:

Can we use std::byte here instead?

in doc/release-notes-35531.md:4 in 8c5562e876

   0 | @@ -0,0 +1,12 @@
   1 | +## Index
   2 | +
   3 | +- The transaction index (`-txindex`) now stores less data on disk: the previous
   4 | +  index used about 66 GB, while the new index uses about 29 GB. The index is

l0rinc commented at 9:33 PM on June 26, 2026:

8c5562e doc: add release notes:

By the time this gets released the 66 vs 29 will be outdated - how about roughly half the size or similar?

- The transaction index (`-txindex`) now stores less data on disk, roughly
  halving the size of a fully rebuilt index. The index is backwards compatible,
  so existing users will not see the space saving unless the index is
  recreated. To do so, stop the node, delete the

in doc/release-notes-35531.md:6 in 8c5562e876

   0 | @@ -0,0 +1,12 @@
   1 | +## Index
   2 | +
   3 | +- The transaction index (`-txindex`) now stores less data on disk: the previous
   4 | +  index used about 66 GB, while the new index uses about 29 GB. The index is
   5 | +  backwards compatible, so existing users will not see the space saving unless
   6 | +  the index is erased and rebuilt. To do so, stop the node, delete the

l0rinc commented at 9:34 PM on June 26, 2026:

8c5562e doc: add release notes:

erased and rebuilt sounds scary - how about "rebuilt"/"recreated"?

in src/index/txindex_key.h:27 in 8c5562e876

  22 | +//! The location of a transaction: the height of the block that contains it and the
  23 | +//! transaction's byte offset within that block (after the header).
  24 | +//!
  25 | +//! Since the offset must always be less than the max block serialized size, we can
  26 | +//! pack the position into a single integer code = max_block_size * height + offset
  27 | +//! and split apart as (height = code / max_block_size, offset = code % max_block_size).

l0rinc commented at 9:50 PM on June 26, 2026:

Can we add a test to validate the boundary crossings of 1-7 bytes?

BOOST_AUTO_TEST_CASE(txindex_position_width_boundaries)
{
    constexpr std::array<std::pair<txindex::BlockTxPosition, size_t>, 14> boundaries{{
        // block height   tx offset   width
        {{0,              0},         1},
        {{0,              255},       1},
        {{0,              256},       2},
        {{0,              65'535},    2},
        {{0,              65'536},    3},
        {{4,              777'215},   3},
        {{4,              777'216},   4},
        {{1'073,          2'967'295}, 4},
        {{1'073,          2'967'296}, 5},
        {{274'877,        3'627'775}, 5},
        {{274'877,        3'627'776}, 6},
        {{70'368'744,     710'655},   6},
        {{70'368'744,     710'656},   7},
        {{4'294'967'295U, 3'999'999}, 7},
    }};
    for (const auto& [position, expected_width] : boundaries) {
        DataStream stream;
        stream << position;
        BOOST_CHECK_EQUAL(stream.size(), expected_width);

        txindex::BlockTxPosition decoded;
        stream >> decoded;
        BOOST_CHECK_EQUAL(decoded.block_height, position.block_height);
        BOOST_CHECK_EQUAL(decoded.tx_offset_in_block, position.tx_offset_in_block);
    }
}

andrewtoth commented at 9:05 PM on June 30, 2026:

Not needed anymore with the static 6 byte suffix.

in src/index/txindex_key.h:48 in 8c5562e876

  43 | +    }
  44 | +
  45 | +    template <typename Stream>
  46 | +    void Unserialize(Stream& s)
  47 | +    {
  48 | +        const size_t width{s.size()};

l0rinc commented at 9:51 PM on June 26, 2026:

Could we avoid architecture-specific types in serialization code?

in src/index/txindex_key.h:86 in 8c5562e876

  81 | +    explicit DBKey(const TxHashKeyPrefix& hash_in, const Position& pos_in) : hash_prefix{hash_in}, pos{pos_in} {}
  82 | +
  83 | +    SERIALIZE_METHODS(DBKey, obj)
  84 | +    {
  85 | +        uint8_t prefix{DB_TXINDEX_HASHED};
  86 | +        READWRITE(prefix);

l0rinc commented at 9:57 PM on June 26, 2026:

Mixing read/write & validation like this seems confusing to me - could the constant prefix be written during serialization and only validated during deserialization?

template <typename Stream>
void Serialize(Stream& s) const
{
    ser_writedata8(s, DB_TXINDEX_HASHED);
    s << hash_prefix << pos;
}

template <typename Stream>
void Unserialize(Stream& s)
{
    if (ser_readdata8(s) != DB_TXINDEX_HASHED) throw std::ios_base::failure("Invalid format for txindex DB key");
    s >> hash_prefix >> pos;
}

l0rinc commented at 2:34 AM on July 17, 2026:

This was reverted in the latest push. The packed suffix makes the persisted format depend on MAX_BLOCK_SERIALIZED_SIZE and couples the block sequence and transaction offset through multiplication and division. The new key types also share one read/write body, which mixes prefix emission with read-time validation.

Given the concerns about the 11-byte layout's complexity, could we keep the 12-byte keys but serialize the sequence and offset as separate three-byte big-endian fields and give each key explicit Serialize and Unserialize paths?

This preserves ordering, supports more than 16 million sequences and every valid transaction offset, and pins the on-disk layout with fixed unit vectors and round-trip fuzz coverage.

<details><summary>simplify txindex position encoding</summary>

diff --git a/src/index/txindex_key.h b/src/index/txindex_key.h
index 75ef9c1dad..4d2d662552 100644
--- a/src/index/txindex_key.h
+++ b/src/index/txindex_key.h
@@ -13,7 +13,6 @@
 #include <uint256.h>
 
 #include <array>
-#include <cassert>
 #include <cstddef>
 #include <cstdint>
 #include <cstring>
@@ -30,32 +29,30 @@ constexpr uint8_t DB_BLOCK_HASH{'h'};
 //! (including the header), so the on-disk position is simply
 //! block_data_pos + tx_offset_in_block.
 //!
-//! Since the offset is always less than the maximum serialized block size, we pack
-//! the position into a single integer code = max_block_size * block_seq + offset, and
-//! split it apart as (block_seq = code / max_block_size, offset = code % max_block_size).
+//! Both values are serialized as three-byte big-endian integers, preserving their
+//! ordering while keeping the position fixed at six bytes.
 struct BlockTxPosition {
     uint32_t block_seq{0};
     uint32_t tx_offset_in_block{0};
 
     friend bool operator==(const BlockTxPosition&, const BlockTxPosition&) = default;
 
-    static constexpr int SERIALIZED_SIZE{6}; // Holds packed positions until block sequence number 70,368,744
+    static constexpr int BLOCK_SEQ_SIZE{3}, TX_OFFSET_SIZE{3};
+    static constexpr int SERIALIZED_SIZE{BLOCK_SEQ_SIZE + TX_OFFSET_SIZE};
+    static_assert(MAX_BLOCK_SERIALIZED_SIZE <= BigEndianFormatter<TX_OFFSET_SIZE>::MAX);
 
     template <typename Stream>
     void Serialize(Stream& s) const
     {
-        assert(tx_offset_in_block < MAX_BLOCK_SERIALIZED_SIZE);
-        const uint64_t code{uint64_t{MAX_BLOCK_SERIALIZED_SIZE} * block_seq + tx_offset_in_block};
-        s << Using<BigEndianFormatter<SERIALIZED_SIZE>>(code);
+        s << Using<BigEndianFormatter<BLOCK_SEQ_SIZE>>(block_seq);
+        s << Using<BigEndianFormatter<TX_OFFSET_SIZE>>(tx_offset_in_block);
     }
 
     template <typename Stream>
     void Unserialize(Stream& s)
     {
-        uint64_t code;
-        s >> Using<BigEndianFormatter<SERIALIZED_SIZE>>(code);
-        block_seq = static_cast<uint32_t>(code / MAX_BLOCK_SERIALIZED_SIZE);
-        tx_offset_in_block = static_cast<uint32_t>(code % MAX_BLOCK_SERIALIZED_SIZE);
+        s >> Using<BigEndianFormatter<BLOCK_SEQ_SIZE>>(block_seq);
+        s >> Using<BigEndianFormatter<TX_OFFSET_SIZE>>(tx_offset_in_block);
     }
 };
 
@@ -63,12 +60,21 @@ struct BlockTxPosition {
 struct BlockSeqKey {
     uint32_t block_seq{0};
 
-    SERIALIZE_METHODS(BlockSeqKey, obj)
+    static constexpr int BLOCK_SEQ_SIZE{4};
+    static_assert(BLOCK_SEQ_SIZE >= BlockTxPosition::BLOCK_SEQ_SIZE);
+
+    template <typename Stream>
+    void Serialize(Stream& s) const
     {
-        uint8_t prefix{DB_BLOCK_SEQ};
-        READWRITE(prefix);
-        if (prefix != DB_BLOCK_SEQ) throw std::ios_base::failure("Invalid format for txindex block seq key");
-        READWRITE(Using<BigEndianFormatter<4>>(obj.block_seq));
+        ser_writedata8(s, DB_BLOCK_SEQ);
+        s << Using<BigEndianFormatter<BLOCK_SEQ_SIZE>>(block_seq);
+    }
+
+    template <typename Stream>
+    void Unserialize(Stream& s)
+    {
+        if (ser_readdata8(s) != DB_BLOCK_SEQ) throw std::ios_base::failure("Invalid format for txindex block seq key");
+        s >> Using<BigEndianFormatter<BLOCK_SEQ_SIZE>>(block_seq);
     }
 };
 
@@ -76,12 +82,18 @@ struct BlockSeqKey {
 struct BlockHashKey {
     uint256 block_hash;
 
-    SERIALIZE_METHODS(BlockHashKey, obj)
+    template <typename Stream>
+    void Serialize(Stream& s) const
+    {
+        ser_writedata8(s, DB_BLOCK_HASH);
+        s << block_hash;
+    }
+
+    template <typename Stream>
+    void Unserialize(Stream& s)
     {
-        uint8_t prefix{DB_BLOCK_HASH};
-        READWRITE(prefix);
-        if (prefix != DB_BLOCK_HASH) throw std::ios_base::failure("Invalid format for txindex block hash key");
-        READWRITE(obj.block_hash);
+        if (ser_readdata8(s) != DB_BLOCK_HASH) throw std::ios_base::failure("Invalid format for txindex block hash key");
+        s >> block_hash;
     }
 };
 
@@ -102,12 +114,18 @@ struct DBKey {
 
     explicit DBKey(const TxHashKeyPrefix& hash_in, const BlockTxPosition& pos_in) : hash_prefix{hash_in}, pos{pos_in} {}
 
-    SERIALIZE_METHODS(DBKey, obj)
+    template <typename Stream>
+    void Serialize(Stream& s) const
+    {
+        ser_writedata8(s, DB_TXINDEX_HASHED);
+        s << hash_prefix << pos;
+    }
+
+    template <typename Stream>
+    void Unserialize(Stream& s)
     {
-        uint8_t prefix{DB_TXINDEX_HASHED};
-        READWRITE(prefix);
-        if (prefix != DB_TXINDEX_HASHED) throw std::ios_base::failure("Invalid format for txindex DB key");
-        READWRITE(obj.hash_prefix, obj.pos);
+        if (ser_readdata8(s) != DB_TXINDEX_HASHED) throw std::ios_base::failure("Invalid format for txindex DB key");
+        s >> hash_prefix >> pos;
     }
 };
 } // namespace txindex
diff --git a/src/test/fuzz/CMakeLists.txt b/src/test/fuzz/CMakeLists.txt
index 29ef7f0457..b521ff9d04 100644
--- a/src/test/fuzz/CMakeLists.txt
+++ b/src/test/fuzz/CMakeLists.txt
@@ -131,6 +131,7 @@ add_executable(fuzz
   tx_in.cpp
   tx_out.cpp
   tx_pool.cpp
+  txindex.cpp
   txgraph.cpp
   txorphan.cpp
   txrequest.cpp
diff --git a/src/test/fuzz/txindex.cpp b/src/test/fuzz/txindex.cpp
new file mode 100644
index 0000000000..d5f96430cf
--- /dev/null
+++ b/src/test/fuzz/txindex.cpp
@@ -0,0 +1,41 @@
+// Copyright (c) The Bitcoin Core developers
+// Distributed under the MIT software license, see the accompanying
+// file COPYING or https://opensource.org/license/mit/.
+
+#include <consensus/consensus.h>
+#include <index/txindex_key.h>
+#include <serialize.h>
+#include <streams.h>
+#include <test/fuzz/FuzzedDataProvider.h>
+#include <test/fuzz/fuzz.h>
+
+#include <algorithm>
+#include <cassert>
+#include <cstddef>
+#include <cstdint>
+#include <vector>
+
+FUZZ_TARGET(txindex_position)
+{
+    FuzzedDataProvider provider{buffer.data(), buffer.size()};
+    constexpr uint32_t max_block_seq{BigEndianFormatter<txindex::BlockTxPosition::BLOCK_SEQ_SIZE>::MAX};
+
+    const auto bytes{provider.ConsumeBytes<std::byte>(txindex::BlockTxPosition::SERIALIZED_SIZE)};
+    if (bytes.size() == txindex::BlockTxPosition::SERIALIZED_SIZE) {
+        txindex::BlockTxPosition position;
+        assert((DataStream{bytes} >> position).empty());
+        assert(std::ranges::equal(bytes, DataStream{} << position));
+    }
+
+    const txindex::BlockTxPosition position{
+        provider.ConsumeIntegralInRange<uint32_t>(0, max_block_seq),
+        provider.ConsumeIntegralInRange<uint32_t>(0, MAX_BLOCK_SERIALIZED_SIZE - 1),
+    };
+    DataStream encoded;
+    encoded << position;
+    assert(encoded.size() == txindex::BlockTxPosition::SERIALIZED_SIZE);
+
+    txindex::BlockTxPosition decoded;
+    assert((encoded >> decoded).empty());
+    assert(decoded == position);
+}
diff --git a/src/test/txindex_tests.cpp b/src/test/txindex_tests.cpp
index 8ec88170cc..32dc92e36a 100644
--- a/src/test/txindex_tests.cpp
+++ b/src/test/txindex_tests.cpp
@@ -22,10 +22,12 @@
 #include <sync.h>
 #include <test/util/setup_common.h>
 #include <util/byte_units.h>
+#include <util/strencodings.h>
 #include <validation.h>
 
 #include <cstdint>
 #include <string>
+#include <string_view>
 #include <utility>
 #include <vector>
 
@@ -70,6 +72,24 @@ FlatFilePos BlockFilePos(const ChainstateManager& chainman, uint32_t height)
 
 } // namespace
 
+BOOST_AUTO_TEST_CASE(txindex_position_encoding)
+{
+    constexpr struct { txindex::BlockTxPosition position; std::string_view encoded; } test_vectors[]{
+        {{0, 0}, "000000000000"},
+        {{1, 2}, "000001000002"},
+        {{10'000'000, 123}, "98968000007b"},
+        {{456, 3'999'999}, "0001c83d08ff"},
+    };
+
+    for (const auto& [position, encoded] : test_vectors) {
+        BOOST_CHECK_EQUAL(HexStr(DataStream{} << position), encoded);
+
+        txindex::BlockTxPosition decoded;
+        BOOST_CHECK((DataStream{ParseHex(encoded)} >> decoded).empty());
+        BOOST_CHECK(decoded == position);
+    }
+}
+
 BOOST_FIXTURE_TEST_CASE(txindex_initial_sync, TestChain100Setup)
 {
     TxIndex txindex(interfaces::MakeChain(m_node), 1_MiB, true);

</details>

andrewtoth commented at 1:49 PM on July 17, 2026:

Indeed, if we store sequence and position as 3 bytes each instead of encoding it does make things simpler. We would be limited to ~16 million blocks, which I think is still ok. I think reduction in complexity is worth it here. For reference, testnet3 is past block 5 million.

We also can have a position up to 16 million, while blocks will limit it to 4 million.

l0rinc commented at 3:50 AM on July 18, 2026:

I like the new serializers, they're still squashed read/write, but it's obvious what's happening now. I'm still missing the tests to pin the format, so matching serializer and deserializer changes don't silently alter the persisted key format.

<details><summary>pin txindex position encoding</summary>

diff --git a/src/test/txindex_tests.cpp b/src/test/txindex_tests.cpp
--- a/src/test/txindex_tests.cpp	(revision cd37d44c7df387a03cf5da0b85a784791b05140b)
+++ b/src/test/txindex_tests.cpp	(revision 1afeef4350d8e531c83ed9ebe4700aa77ca1a226)
@@ -8,6 +8,7 @@
 #include <common/args.h>
 #include <consensus/amount.h>
 #include <consensus/validation.h>
+#include <crypto/hex_base.h>
 #include <dbwrapper.h>
 #include <flatfile.h>
 #include <index/disktxpos.h>
@@ -22,10 +23,12 @@
 #include <sync.h>
 #include <test/util/setup_common.h>
 #include <util/byte_units.h>
+#include <util/strencodings.h>
 #include <validation.h>

 #include <cstdint>
 #include <string>
+#include <string_view>
 #include <utility>
 #include <vector>

@@ -70,6 +73,24 @@

 } // namespace

+BOOST_AUTO_TEST_CASE(txindex_position_encoding)
+{
+    constexpr struct { txindex::BlockTxPosition position; std::string_view encoded; } test_vectors[]{
+        {{0, 0}, "000000000000"},
+        {{1, 2}, "000001000002"},
+        {{10'000'000, 123}, "98968000007b"},
+        {{456, 3'999'999}, "0001c83d08ff"},
+    };
+
+    for (const auto& [position, encoded] : test_vectors) {
+        BOOST_CHECK_EQUAL(HexStr(DataStream{} << position), encoded);
+
+        txindex::BlockTxPosition decoded;
+        BOOST_CHECK((DataStream{ParseHex(encoded)} >> decoded).empty());
+        BOOST_CHECK(decoded == position);
+    }
+}
+
 BOOST_FIXTURE_TEST_CASE(txindex_initial_sync, TestChain100Setup)
 {
     TxIndex txindex(interfaces::MakeChain(m_node), 1_MiB, true);

</details>

in src/index/txindex.cpp:98 in 8c5562e876 outdated

  98 | +    batch.Write(DB_BEST_BLOCK_V2, locator);
  99 | +}
 100 | +
 101 | +void TxIndex::DB::WriteTxs(const interfaces::BlockInfo& block, bool erase)
 102 |  {
 103 |      CDBBatch batch(*this);

l0rinc commented at 9:59 PM on June 26, 2026:

https://github.com/bitcoin-core/leveldb-subtree/pull/48 would come in handy here

diff --git a/src/dbwrapper.cpp b/src/dbwrapper.cpp
index ffe6f267a6..e23cf2f4f6 100644
--- a/src/dbwrapper.cpp
+++ b/src/dbwrapper.cpp
@@ -175,6 +175,11 @@ void CDBBatch::Clear()
     assert(m_value_scratch.empty());
 }
 
+void CDBBatch::Reserve(size_t size)
+{
+    m_impl_batch->batch.Reserve(size);
+}
+
 void CDBBatch::WriteImpl(std::span<const std::byte> key, DataStream& value)
 {
     leveldb::Slice slKey(CharCast(key.data()), key.size());
diff --git a/src/dbwrapper.h b/src/dbwrapper.h
index 83da6febe7..0fd243cd6e 100644
--- a/src/dbwrapper.h
+++ b/src/dbwrapper.h
@@ -102,6 +102,7 @@ public:
     explicit CDBBatch(const CDBWrapper& _parent);
     ~CDBBatch();
     void Clear();
+    void Reserve(size_t size);
 
     template <typename K, typename V>
     void Write(const K& key, const V& value)
diff --git a/src/index/txindex.cpp b/src/index/txindex.cpp
index f1756b9120..0f05e0ab1f 100644
--- a/src/index/txindex.cpp
+++ b/src/index/txindex.cpp
@@ -99,6 +99,8 @@ void TxIndex::DB::WriteTxs(const interfaces::BlockInfo& block, bool erase)
 {
     assert(block.data);
     CDBBatch batch(*this);
+    const auto batch_size{batch.ApproximateSize() + block.data->vtx.size() * (1 + 1 + 1 + txindex::TxHashKeyPrefix{}.size() + 6 + 1)}; // tag + key length + db prefix + hash prefix + compact position + empty value length
+    batch.Reserve(batch_size);
     uint32_t tx_offset_in_block{GetSizeOfCompactSize(block.data->vtx.size())};
     for (const auto& tx : block.data->vtx) {
         const txindex::DBKey key{txindex::CreateKeyPrefix(m_hasher, tx->GetHash()),
@@ -110,6 +112,7 @@ void TxIndex::DB::WriteTxs(const interfaces::BlockInfo& block, bool erase)
         }
         tx_offset_in_block += ::GetSerializeSize(TX_WITH_WITNESS(*tx));
     }
+    assert(batch.ApproximateSize() <= batch_size); // TODO remove
     WriteBatch(batch);
 }
 
diff --git a/src/leveldb/db/write_batch.cc b/src/leveldb/db/write_batch.cc
index b54313c35e..b2cb2103d8 100644
--- a/src/leveldb/db/write_batch.cc
+++ b/src/leveldb/db/write_batch.cc
@@ -37,6 +37,8 @@ void WriteBatch::Clear() {
   rep_.resize(kHeader);
 }
 
+void WriteBatch::Reserve(size_t size) { rep_.reserve(size); }
+
 size_t WriteBatch::ApproximateSize() const { return rep_.size(); }
 
 Status WriteBatch::Iterate(Handler* handler) const {
diff --git a/src/leveldb/include/leveldb/write_batch.h b/src/leveldb/include/leveldb/write_batch.h
index 94d4115fed..e05287e299 100644
--- a/src/leveldb/include/leveldb/write_batch.h
+++ b/src/leveldb/include/leveldb/write_batch.h
@@ -21,6 +21,7 @@
 #ifndef STORAGE_LEVELDB_INCLUDE_WRITE_BATCH_H_
 #define STORAGE_LEVELDB_INCLUDE_WRITE_BATCH_H_
 
+#include <cstddef>
 #include <string>
 
 #include "leveldb/export.h"
@@ -56,6 +57,9 @@ class LEVELDB_EXPORT WriteBatch {
   // Clear all updates buffered in this batch.
   void Clear();
 
+  // Reserve space for updates buffered in this batch.
+  void Reserve(size_t size);
+
   // The size of the database changes caused by this batch.
   //
   // This number is tied to implementation details, and may change across
diff --git a/src/test/dbwrapper_tests.cpp b/src/test/dbwrapper_tests.cpp
index 185bf491e5..8dd0699a08 100644
--- a/src/test/dbwrapper_tests.cpp
+++ b/src/test/dbwrapper_tests.cpp
@@ -167,6 +167,9 @@ BOOST_AUTO_TEST_CASE(dbwrapper_batch)
 
         uint256 res;
         CDBBatch batch(dbw);
+        const auto empty_batch_size{batch.ApproximateSize()};
+        batch.Reserve(1_MiB);
+        BOOST_CHECK_EQUAL(batch.ApproximateSize(), empty_batch_size);
 
         batch.Write(key, in);
         batch.Write(key2, in2);

andrewtoth commented at 9:05 PM on June 30, 2026:

Can be added safely after the leveldb change is merged.

in src/index/txindex_key.h:28 in 8c5562e876

  23 | +//! transaction's byte offset within that block (after the header).
  24 | +//!
  25 | +//! Since the offset must always be less than the max block serialized size, we can
  26 | +//! pack the position into a single integer code = max_block_size * height + offset
  27 | +//! and split apart as (height = code / max_block_size, offset = code % max_block_size).
  28 | +struct Position {

l0rinc commented at 10:11 PM on June 26, 2026:

I originally thought tx_offset was the block-file offset; we might want to clarify that it's the serialized byte offset inside the block. We could rename the type to something less general, maybe BlockTxPosition (which would also make clear that tx_offset is a serialized-block offset).

in src/index/txindex.cpp:166 in 8c5562e876 outdated

 169 | +        if (file.IsNull()) {
 170 | +            LogError("OpenBlockFile failed");
 171 | +            return false;
 172 | +        }
 173 | +        try {
 174 | +            file >> TX_WITH_WITNESS(tx);

l0rinc commented at 11:01 PM on June 29, 2026:

Should we mutate tx when the hash doesn't match? A hash-prefix false positive can be left in the output tx when the scan misses the requested txid. Given that https://github.com/bitcoin/bitcoin/blob/7a851180058facf7824903cc46a1948beb944ed5/src/rpc/rawtransaction.cpp#L156 ignores the return value, tx could contain the wrong value after the call.

diff --git a/src/test/txindex_tests.cpp b/src/test/txindex_tests.cpp
--- a/src/test/txindex_tests.cpp	(revision 5ab57fe7288f9cbec8437317d8e51496900e97f6)
+++ b/src/test/txindex_tests.cpp	(revision 75a9d1198f25488b64fb02daf437e3fe025fd64c)
@@ -157,6 +157,7 @@
     it->Next();
     BOOST_REQUIRE(it->Valid() && it->GetKey(key) && key.hash_prefix == target_prefix);
     BOOST_CHECK(read_txid(key.pos) == target_txid);
+    const txindex::Position target_pos{key.pos};
 
     CTransactionRef tx_disk;
     uint256 block_hash;
@@ -164,6 +165,12 @@
     BOOST_REQUIRE(tx_disk);
     BOOST_CHECK(tx_disk->GetHash() == target_txid);
 
+    db.Erase(txindex::DBKey{target_prefix, target_pos});
+    CTransactionRef missing_tx;
+    BOOST_CHECK(!txindex.FindTx(target_txid, block_hash, missing_tx));
+    BOOST_CHECK(!missing_tx);
+    db.Write(txindex::DBKey{target_prefix, target_pos}, "");
+
     // Legacy fallback: drop the first coinbase's hashed entry and re-add it under the
     // old 't' + txid schema (a physical CDiskTxPos), then confirm the lookup still
     // finds it via the legacy path.

andrewtoth commented at 1:30 PM on June 30, 2026:

Nice catch! There were no false positives before, so this bug was never triggered.

in src/index/txindex.cpp:144 in 8c5562e876 outdated

 147 |      return true;
 148 |  }
 149 |  
 150 |  BaseIndex::DB& TxIndex::GetDB() const { return *m_db; }
 151 |  
 152 |  bool TxIndex::FindTx(const Txid& tx_hash, uint256& block_hash, CTransactionRef& tx) const

l0rinc commented at 11:03 PM on June 29, 2026:

can we make this [[nodiscard]] to avoid the situation below - and maybe mention that @param[out] tx The transaction itself. is undefined if we return false?

in doc/release-notes-35531.md:10 in 8c5562e876 outdated

   5 | +  backwards compatible, so existing users will not see the space saving unless
   6 | +  the index is erased and rebuilt. To do so, stop the node, delete the
   7 | +  `<datadir>/indexes/txindex` directory, and restart; rebuilding can take up to
   8 | +  a few hours depending on hardware. Once rebuilt, the index can no longer be
   9 | +  read by previous releases, so downgrading will require rebuilding it again.
  10 | +  Additionally, transactions that are only in blocks reorged out of the best

l0rinc commented at 11:07 PM on June 29, 2026:

Not sure it matters but will this return the same historical duplicate transaction IDs as before (i.e. BIP30 duplicates?)

andrewtoth commented at 8:45 PM on June 30, 2026:

Interesting! For BIP30 txs both this and legacy indexes will return the same tx data. But, the legacy index would return the later block, and this index will return the earlier block. I don't think it really matters though. Is it worth documenting?

l0rinc commented at 8:49 PM on June 30, 2026:

Is it worth documenting?

it's worth a code comment I'd say...

andrewtoth commented at 8:25 PM on July 12, 2026:

Added a code comment about this.

in src/test/txindex_tests.cpp:152 in 8c5562e876 outdated

 147 | +    BOOST_REQUIRE(it->Valid() && it->GetKey(key) && key.hash_prefix == fake_prefix);
 148 | +    const txindex::Position fake_pos{key.pos};
 149 | +
 150 | +    db.Write(txindex::DBKey{target_prefix, fake_pos}, "");
 151 | +
 152 | +    // The target's bucket now holds the forged false positive first, then the real target.

l0rinc commented at 11:17 PM on June 29, 2026:

c6d3197 tests: cover txindex hash prefix collisions, legacy lookups and erasure:

Could the test assert the encoded positions it already controls instead of reimplementing tx reads? I don't fully understand why we're re-reading, wouldn't this suffice?

// The target's bucket now holds the forged false positive first, then the real target.
it.reset(db.NewIterator());
it->Seek(std::pair{txindex::DB_TXINDEX_HASHED, target_prefix});
BOOST_REQUIRE(it->Valid() && it->GetKey(key) && key.hash_prefix == target_prefix);
BOOST_CHECK(key.pos.block_height == fake_pos.block_height);
BOOST_CHECK(key.pos.tx_offset_in_block == fake_pos.tx_offset_in_block);
it->Next();
BOOST_REQUIRE(it->Valid() && it->GetKey(key) && key.hash_prefix == target_prefix);
const txindex::BlockTxPosition target_pos{key.pos};
BOOST_CHECK(target_pos.block_height != fake_pos.block_height || target_pos.tx_offset_in_block != fake_pos.tx_offset_in_block);

in src/index/txindex.cpp:131 in 8c5562e876

 126 |  bool TxIndex::CustomAppend(const interfaces::BlockInfo& block)
 127 |  {
 128 |      // Exclude genesis block transaction because outputs are not spendable.
 129 |      if (block.height == 0) return true;
 130 |  
 131 |      assert(block.data);

l0rinc commented at 11:20 PM on June 29, 2026:

WriteTxs needs block.data both from CustomAppend and CustomRemove, we could add the assertion inside instead:

diff --git a/src/index/txindex.cpp b/src/index/txindex.cpp
--- a/src/index/txindex.cpp	(revision f7bcaef568c2d803961efc379d01facb7efdfec0)
+++ b/src/index/txindex.cpp	(revision 25de18042050c5f7998265021420c74ab4b84789)
@@ -97,6 +97,7 @@
 
 void TxIndex::DB::WriteTxs(const interfaces::BlockInfo& block, bool erase)
 {
+    assert(block.data);
     CDBBatch batch(*this);
     uint32_t tx_offset_in_block{GetSizeOfCompactSize(block.data->vtx.size())};
     for (const auto& tx : block.data->vtx) {
@@ -131,7 +132,6 @@
     // Exclude genesis block transaction because outputs are not spendable.
     if (block.height == 0) return true;
 
-    assert(block.data);
     m_db->WriteTxs(block);
     return true;
 }

in src/index/txindex.cpp:138 in 8c5562e876 outdated

 141 | +    return true;
 142 | +}
 143 | +
 144 | +bool TxIndex::CustomRemove(const interfaces::BlockInfo& block)
 145 | +{
 146 | +    m_db->WriteTxs(block, /*erase=*/true);

l0rinc commented at 11:50 PM on June 29, 2026:

was just wondering what happens if we want to undo genesis, but it seems to be explicitly guarded, so maybe we could document it here (nit, just resolve if you disagree):

    assert(block.height > 0);

in src/index/txindex.cpp:158 in 8c5562e876

 161 | +        CBlockIndex* block_index;
 162 | +        {
 163 | +            LOCK(cs_main);
 164 | +            block_index = m_chainstate->m_chain[key.pos.block_height];
 165 | +            if (!block_index) continue;
 166 | +            tx_pos = FlatFilePos{block_index->nFile, block_index->nDataPos + header_offset + key.pos.tx_offset};

l0rinc commented at 11:53 PM on June 29, 2026:

We're copying tx_pos under the lock, but keep the CBlockIndex* alive after the lock only to read the block hash if the candidate transaction matches.

Could we copy the block hash under the same lock too, so the unlocked file read uses only local values?

FlatFilePos tx_pos;
uint256 candidate_block_hash;
{
    LOCK(cs_main);
    const CBlockIndex* block_index{m_chainstate->m_chain[key.pos.block_height]};
    if (!block_index) continue;
    tx_pos = FlatFilePos{block_index->nFile, block_index->nDataPos + CBlockHeader::SERIALIZED_SIZE + key.pos.tx_offset_in_block};
    candidate_block_hash = block_index->GetBlockHash();
}

...

tx = candidate_tx;
block_hash = candidate_block_hash;

andrewtoth commented at 1:32 PM on June 30, 2026:

CBlockIndex* being returned is immutable. There are certain fields on it that are guarded by cs_main (like nFile and nDataPos), but others can be read without the lock (like GetBlockHash()).

I think keeping the index is the correct pattern here.

l0rinc commented at 6:06 PM on June 30, 2026:

Isn't that the case for tx_pos as well, any reason for constructing that inside but not the candidate_block_hash? Even if that's not the case, seems simpler to only expose what's strictly needed after the scope terminated.

andrewtoth commented at 7:21 PM on June 30, 2026:

tx_pos needs nFile and nDataPos which are guarded by cs_main.

in src/index/txindex.cpp:72 in 8c5562e876

  71 | +        if (!db.Read("siphash_key", siphash_key)) {
  72 | +            FastRandomContext rng{};
  73 | +            siphash_key = {rng.rand64(), rng.rand64()};
  74 | +            db.Write("siphash_key", siphash_key, /*fSync=*/true);
  75 | +        }
  76 | +        return PresaltedSipHasher{siphash_key.first, siphash_key.second};

l0rinc commented at 12:06 AM on June 30, 2026:

This looks like an ad-hoc deserialization code for PresaltedSipHasher - could we encapsulate that inside the object itself?

diff --git a/src/crypto/siphash.h b/src/crypto/siphash.h
index 2f28473a4f..e1eeeb25c5 100644
--- a/src/crypto/siphash.h
+++ b/src/crypto/siphash.h
@@ -10,10 +10,13 @@
 #include <span>
 
 class uint256;
+class PresaltedSipHasher;
 
 /** Shared SipHash internal state v[0..3], initialized from (k0, k1). */
 class SipHashState
 {
+    friend class PresaltedSipHasher;
+
     static constexpr uint64_t C0{0x736f6d6570736575ULL}, C1{0x646f72616e646f6dULL}, C2{0x6c7967656e657261ULL}, C3{0x7465646279746573ULL};
 
 public:
@@ -48,17 +51,32 @@ public:
  *
  * This class caches the initial SipHash v[0..3] state derived from (k0, k1)
  * and implements a specialized hashing path for uint256 values, with or
- * without an extra 32-bit word. The internal state is immutable, so
- * PresaltedSipHasher instances can be reused for multiple hashes with the
- * same key.
+ * without an extra 32-bit word. The call operators leave the cached state
+ * unchanged, so PresaltedSipHasher instances can be reused for multiple hashes
+ * with the same key.
  */
 class PresaltedSipHasher
 {
-    const SipHashState m_state;
+    SipHashState m_state;
 
 public:
+    PresaltedSipHasher() noexcept : PresaltedSipHasher{0, 0} {}
     explicit PresaltedSipHasher(uint64_t k0, uint64_t k1) noexcept : m_state{k0, k1} {}
 
+    template <typename Stream>
+    void Serialize(Stream& s) const
+    {
+        s << (m_state.v[0] ^ SipHashState::C0) << (m_state.v[1] ^ SipHashState::C1);
+    }
+
+    template <typename Stream>
+    void Unserialize(Stream& s)
+    {
+        uint64_t k0, k1;
+        s >> k0 >> k1;
+        m_state = SipHashState{k0, k1};
+    }
+
     /** Equivalent to CSipHasher(k0, k1).Write(val).Finalize(). */
     uint64_t operator()(const uint256& val) const noexcept;
 
diff --git a/src/index/txindex.cpp b/src/index/txindex.cpp
index a71e8046f7..404b57a56a 100644
--- a/src/index/txindex.cpp
+++ b/src/index/txindex.cpp
@@ -21,6 +21,7 @@
 #include <streams.h>
 #include <sync.h>
 #include <uint256.h>
+#include <util/check.h>
 #include <util/fs.h>
 #include <util/log.h>
 #include <validation.h>
@@ -63,13 +64,14 @@ public:
 TxIndex::DB::DB(size_t n_cache_size, bool f_memory, bool f_wipe) :
     BaseIndex::DB(gArgs.GetDataDirNet() / "indexes" / "txindex", n_cache_size, f_memory, f_wipe),
     m_hasher{[](CDBWrapper& db) {
-        std::pair<uint64_t, uint64_t> siphash_key;
-        if (!db.Read("siphash_key", siphash_key)) {
+        PresaltedSipHasher hasher;
+        if (!db.Read(txindex::DB_TXID_HASH_SALT, hasher)) {
+            // The salt only needs to be generated once and persisted.
             FastRandomContext rng{};
-            siphash_key = {rng.rand64(), rng.rand64()};
-            db.Write("siphash_key", siphash_key, /*fSync=*/true);
+            db.Write(txindex::DB_TXID_HASH_SALT, PresaltedSipHasher{rng.rand64(), rng.rand64()}, /*fSync=*/true);
+            Assert(db.Read(txindex::DB_TXID_HASH_SALT, hasher));
         }
-        return PresaltedSipHasher{siphash_key.first, siphash_key.second};
+        return hasher;
     }(*this)}
 {}
 
diff --git a/src/index/txindex_key.h b/src/index/txindex_key.h
index 025638c36a..8228de8075 100644
--- a/src/index/txindex_key.h
+++ b/src/index/txindex_key.h
@@ -15,9 +15,11 @@
 #include <cstddef>
 #include <cstdint>
 #include <ios>
+#include <string>
 
 namespace txindex {
 constexpr uint8_t DB_TXINDEX_HASHED{'x'};
+static const std::string DB_TXID_HASH_SALT{"txid_hash_salt"};
 
 //! The location of a transaction: the height of the block that contains it and the
 //! transaction's byte offset within that block (after the header).
diff --git a/src/test/hash_tests.cpp b/src/test/hash_tests.cpp
index a5059a8fe8..d51fbecf9f 100644
--- a/src/test/hash_tests.cpp
+++ b/src/test/hash_tests.cpp
@@ -5,6 +5,7 @@
 #include <clientversion.h>
 #include <crypto/siphash.h>
 #include <hash.h>
+#include <streams.h>
 #include <test/util/random.h>
 #include <test/util/setup_common.h>
 #include <util/strencodings.h>
@@ -128,7 +129,14 @@ BOOST_AUTO_TEST_CASE(siphash)
     // and the test would be affected by default tx version bumps if not fixed.
     tx.version = 1;
     ss << TX_WITH_WITNESS(tx);
-    BOOST_CHECK_EQUAL(PresaltedSipHasher(1, 2)(ss.GetHash()), 0x79751e980c2a0a35ULL);
+    const uint256 tx_hash{ss.GetHash()};
+    BOOST_CHECK_EQUAL(PresaltedSipHasher(1, 2)(tx_hash), 0x79751e980c2a0a35ULL);
+
+    PresaltedSipHasher roundtrip_hasher;
+    DataStream serialized_hasher;
+    serialized_hasher << PresaltedSipHasher{1, 2};
+    serialized_hasher >> roundtrip_hasher;
+    BOOST_CHECK_EQUAL(roundtrip_hasher(tx_hash), 0x79751e980c2a0a35ULL);
 
     // Check consistency between CSipHasher and PresaltedSipHasher.
     FastRandomContext ctx;
diff --git a/src/test/txindex_tests.cpp b/src/test/txindex_tests.cpp
index 7a01325102..c9a65fe7f7 100644
--- a/src/test/txindex_tests.cpp
+++ b/src/test/txindex_tests.cpp
@@ -99,9 +99,8 @@ BOOST_FIXTURE_TEST_CASE(txindex_collision_scan_path, TestChain100Setup)
 
     CDBWrapper& db{TxIndexTest::GetDB(txindex)};
     ChainstateManager& chainman{*m_node.chainman};
-    std::pair<uint64_t, uint64_t> siphash_key;
-    BOOST_REQUIRE(db.Read("siphash_key", siphash_key));
-    const PresaltedSipHasher hasher{siphash_key.first, siphash_key.second};
+    PresaltedSipHasher hasher;
+    BOOST_REQUIRE(db.Read(txindex::DB_TXID_HASH_SALT, hasher));
 
     // Resolve a position to its physical on-disk location via the active chain, the
     // same way TxIndex::FindTx does.

andrewtoth commented at 1:34 PM on June 30, 2026:

This seems like a pretty invasive change to siphasher code. I'm not sure if it's worth the review effort rather than just serializing 2 uint64_ts?

l0rinc commented at 6:09 PM on June 30, 2026:

It simplifies usage and encapsulates the serialization. We can investigate if it's worth doing in a preceding PR instead. I can also accept if you don't think it's a good idea, but I'd like to explore it.

andrewtoth commented at 7:21 PM on June 30, 2026:

Doesn't need to precede it, can be done in a follow-up. Based on your diff it should be compatible with the current serialization.

l0rinc commented at 7:22 PM on June 30, 2026:

Agree, please resolve

in src/index/txindex.cpp:151 in 8c5562e876 outdated

 154 | +    const txindex::TxHashKeyPrefix prefix{txindex::CreateKeyPrefix(m_db->m_hasher, tx_hash)};
 155 | +    std::unique_ptr<CDBIterator> it{m_db->NewIterator()};
 156 | +    it->Seek(std::pair{txindex::DB_TXINDEX_HASHED, prefix});
 157 | +    txindex::DBKey key{prefix, {}};
 158 | +    const auto header_offset{static_cast<uint32_t>(GetSerializeSize(CBlockHeader{}))};
 159 | +    for (; it->Valid() && it->GetKey(key) && key.hash_prefix == prefix; it->Next()) {

l0rinc commented at 1:41 AM on June 30, 2026:

It seems to me a malformed new key would just fall through to the legacy read instead of failing:

diff --git a/src/test/txindex_tests.cpp b/src/test/txindex_tests.cpp
--- a/src/test/txindex_tests.cpp	(revision 0948a9dfac92cfbe14bee0f6e79ca02975454737)
+++ b/src/test/txindex_tests.cpp	(date 1782783654685)
@@ -21,6 +21,7 @@
 #include <sync.h>
 #include <test/util/setup_common.h>
 #include <util/byte_units.h>
+#include <util/check.h>
 #include <validation.h>
 
 #include <array>
@@ -187,6 +188,12 @@
     BOOST_CHECK(!missing_tx);
     db.Write(txindex::DBKey{target_prefix, target_pos}, "");
 
+    db.Write(std::make_pair(uint8_t{'t'}, target_txid.ToUint256()), *Assert(resolve_pos(target_pos))); // Legacy fallback entry.
+    db.Write(std::pair{txindex::DB_TXINDEX_HASHED, target_prefix}, ""); // Malformed hashed key without position suffix.
+    tx_disk.reset();
+    BOOST_CHECK(!txindex.FindTx(target_txid, block_hash, tx_disk));
+    BOOST_CHECK(!tx_disk);
+
     // Legacy fallback: drop the first coinbase's hashed entry and re-add it under the
     // old 't' + txid schema (a physical CDiskTxPos), then confirm the lookup still
     // finds it via the legacy path.

andrewtoth commented at 7:24 PM on June 30, 2026:

The legacy read would then return false though, so should be fine?

l0rinc commented at 8:52 PM on June 30, 2026:

I don't have the change checked out locally, but regardless of the return value we should likely catch invalid content early (if we can sanity-check cheaply).

andrewtoth commented at 9:16 PM on June 30, 2026:

The it->GetKey will return false if it fails to deserialize, so it keeps the same behavior as current. A malformed key just returns false from FindTx. I think that's fine. Malformed keys shouldn't be written to the db.

l0rinc commented at 9:21 PM on June 30, 2026:

Malformed keys shouldn't be written to the db.

Sure, but they can still be read as malformed (e.g. from worn-out SD cards).

andrewtoth commented at 4:42 PM on July 4, 2026:

This seems like an issue with CDBIterator::GetKey. All iterators in the codebase return false for a malformed key. This would require a general fix touching all consumers. I think it is out of scope for this PR.

in src/index/txindex_key.h:38 in 8c5562e876

  33 | +    void Serialize(Stream& s) const
  34 | +    {
  35 | +        assert(tx_offset < MAX_BLOCK_SERIALIZED_SIZE);
  36 | +        const uint64_t code{uint64_t{MAX_BLOCK_SERIALIZED_SIZE} * block_height + tx_offset};
  37 | +        size_t width{1};
  38 | +        for (uint64_t v{code >> 8}; v != 0; v >>= 8) ++width;

l0rinc commented at 2:08 AM on June 30, 2026:

1d5b61c txindex: hash keys and pack positions to reduce disk usage:

We should be able to use an intrinsic for this. std::bit_width would do the loop, but since you start from 1, we could just always set the lowest bit and do something like:

        const auto width{CeilDiv(unsigned(std::bit_width(code | 1)), 8u)};

It still bothers me that we're doing all of this manually when 6 bytes would already cover heights 274,878 to 70,368,744, so we could simply use a fixed BigEndianFormatter<6> for reading and writing:

static constexpr int SERIALIZED_SIZE{6}; // Holds packed positions until height 70,368,744

template <typename Stream>
void Serialize(Stream& s) const
{
    assert(tx_offset_in_block < MAX_BLOCK_SERIALIZED_SIZE);
    const uint64_t code{uint64_t{MAX_BLOCK_SERIALIZED_SIZE} * block_height + tx_offset_in_block};
    s << Using<BigEndianFormatter<SERIALIZED_SIZE>>(code);
}

template <typename Stream>
void Unserialize(Stream& s)
{
    uint64_t code;
    s >> Using<BigEndianFormatter<SERIALIZED_SIZE>>(code);
    block_height = uint32_t(code / MAX_BLOCK_SERIALIZED_SIZE);
    tx_offset_in_block = uint32_t(code % MAX_BLOCK_SERIALIZED_SIZE);
}

in src/index/txindex.cpp:123 in 8c5562e876

 115 | @@ -71,27 +116,65 @@ TxIndex::TxIndex(std::unique_ptr<interfaces::Chain> chain, size_t n_cache_size,
 116 |  
 117 |  TxIndex::~TxIndex() = default;
 118 |  
 119 | +interfaces::Chain::NotifyOptions TxIndex::CustomOptions()
 120 | +{
 121 | +    interfaces::Chain::NotifyOptions options;
 122 | +    options.disconnect_data = true;
 123 | +    return options;

l0rinc commented at 5:16 AM on June 30, 2026:

nit:

    return {.disconnect_data = true};

l0rinc changes_requested

l0rinc commented at 6:37 AM on June 30, 2026: contributor

I like the overall approach and the disk-space reduction, but I’m not ready to ACK yet.

My main concerns are:

batch.Write(key, "") appears to serialize a one-byte \0 value instead of a truly empty value, which is costly at txindex scale (we should apply the fix to TxoSpenderIndex as well).
FindTx can mutate the output tx with a hash-prefix false positive even when it returns false, and at least one caller ignores the return value.
The position encoding is variable-width and hand-rolled, while a fixed 6-byte BigEndianFormatter seems simpler and covers block heights up to ~70M.
Malformed hashed keys should fail closed instead of falling through to legacy lookup.

The rest of my comments are mostly simplifications, naming, documentation, and test coverage around the new persisted format.

andrewtoth force-pushed on Jun 30, 2026

sedited referenced this in commit b393985aa0 on Jul 4, 2026

willcl-ark added the label UTXO Db and Indexes on Jul 9, 2026

willcl-ark added the label Resource usage on Jul 9, 2026

DrahtBot added the label Needs rebase on Jul 12, 2026

andrewtoth force-pushed on Jul 12, 2026

andrewtoth commented at 8:27 PM on July 12, 2026: contributor

Rebased due to #35568. We disable bloom filters now too (since we only use iterators for reads now and not point lookups) to get us down to 26 GB.

Addressed all of @l0rinc's suggestions (thanks!).

DrahtBot added the label CI failed on Jul 12, 2026

andrewtoth force-pushed on Jul 12, 2026

DrahtBot removed the label Needs rebase on Jul 12, 2026

DrahtBot removed the label CI failed on Jul 12, 2026

l0rinc commented at 1:40 AM on July 13, 2026: contributor

Very impressive results here, too! Will review in more detail soon.

One quick question in the meantime:

Some collisions will occur, but the penalty is just an extra disk read, deserialization and hash.

This assumes the blocks are available locally, but if we extend this later to a proxy that fetches the blocks on demand for pruned nodes, we probably want to adjust the prefix count accordingly, either as a bigger constant, or by having a different prefix size for the two cases where the duplicate check price is radically different (full archival reading from disk vs pruned fetching block from network).

<details><summary>Detailsx86_64 Ryzen 7 3700X, SSD: 66G → 26G, 1.60x faster indexing</summary>

BEFORE="907e284e303ae57c9f983f0709586df0377b33d6"; AFTER="d65ea3305ed52fdfdf7b235018ae1a819d52a2de"; DATA_DIR="/mnt/my_storage/BitcoinData"; export DATA_DIR; LOG="${DATA_DIR}/debug.log"; export LOG; SIZE_LOG="${PWD}/txindex-size.log"; export SIZE_LOG; : > "${SIZE_LOG}"; wait_index() { pid="$1"; while kill -0 "$pid" 2>/dev/null; do grep -q 'txindex is enabled at height' "${LOG}" 2>/dev/null && return 0; sleep 5; done; grep -q 'txindex is enabled at height' "${LOG}" 2>/dev/null; }; export -f wait_index; run_index() { label="$1"; daemon="$2"; shift 2; cli="${daemon%bitcoind}bitcoin-cli"; "$daemon" "$@" & pid=$!; if ! wait_index "$pid"; then wait "$pid" || true; echo "$label: bitcoind exited before txindex was enabled" >&2; tail -100 "${LOG}" >&2 || true; return 1; fi; "$cli" -datadir="${DATA_DIR}" stop >/dev/null || kill "$pid" 2>/dev/null || true; wait "$pid" || true; sleep 10; du -sh "${DATA_DIR}/indexes/txindex" | awk -v label="$label" '{print label " size: " $1}' >> "${SIZE_LOG}"; du -sb "${DATA_DIR}/indexes/txindex" | awk -v label="$label" '{print label " bytes: " $1}' >> "${SIZE_LOG}"; }; export -f run_index; git reset --hard >/dev/null 2>&1 && git clean -fxd >/dev/null 2>&1 && git fetch origin $BEFORE $AFTER >/dev/null 2>&1; for c in $BEFORE:build-before $AFTER:build-after; do   git checkout ${c%:*} >/dev/null 2>&1 && cmake -B ${c#*:} -G Ninja -DCMAKE_BUILD_TYPE=Release >/dev/null 2>&1 && ninja -C ${c#*:} bitcoind bitcoin-cli >/dev/null 2>&1; done; echo "txindex | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) threads | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $DATA_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $DATA_DIR | tail -1) | grep -q 1 && echo HDD || echo SSD)" && hyperfine --runs 1 --shell bash --sort command   --prepare "rm -rf ${DATA_DIR}/indexes/* ${DATA_DIR}/debug.log"   "run_index before ./build-before/bin/bitcoind -datadir=${DATA_DIR} -txindex=1 -connect=0 -printtoconsole=0"   "run_index after  ./build-after/bin/bitcoind  -datadir=${DATA_DIR} -txindex=1 -connect=0 -printtoconsole=0"; echo "on-disk sizes:" && cat "${SIZE_LOG}"
txindex | ssd-ryzen | x86_64 | AMD Ryzen 7 3700X 8-Core Processor | 16 threads | 62Gi RAM | ext4 | SSD
Benchmark 1: run_index before ./build-before/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -txindex=1 -connect=0 -printtoconsole=0
  Time (abs ≡):        6625.822 s               [User: 7809.061 s, System: 1725.243 s]

Benchmark 2: run_index after  ./build-after/bin/bitcoind  -datadir=/mnt/my_storage/BitcoinData -txindex=1 -connect=0 -printtoconsole=0
  Time (abs ≡):        4138.617 s               [User: 6683.912 s, System: 769.002 s]

Relative speed comparison
        1.60          run_index before ./build-before/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -txindex=1 -connect=0 -printtoconsole=0
        1.00          run_index after  ./build-after/bin/bitcoind  -datadir=/mnt/my_storage/BitcoinData -txindex=1 -connect=0 -printtoconsole=0

on-disk sizes:
before size: 66G
before bytes: 70846285271
after size: 26G
after bytes: 27266296199

</details>

in src/index/txindex.cpp:174 in e467357cf2

 178 | +        FlatFilePos tx_pos;
 179 | +        const CBlockIndex* block_index;
 180 | +        {
 181 | +            LOCK(cs_main);
 182 | +            block_index = m_chainstate->m_chain[key.pos.block_height];
 183 | +            if (!block_index) continue;

andrewtoth commented at 2:44 AM on July 13, 2026:

@l0rinc Threading here for comment #35531 (comment).

This assumes the blocks are available locally, but if we extend this later to a proxy that fetches the blocks on demand for pruned nodes, we probably want to adjust the prefix count accordingly, either as a bigger constant, or by having a different prefix size for the two cases where the duplicate check price is radically different (full archival reading from disk vs pruned fetching block from network).

Alternatively, if we are looping through the collisions and we come across a pruned entry, we could finish the loop before requesting the pruned block be fetched. That way false positive collisions would be ignored for that extreme case. Of course if both (or more) are pruned, we would still have to fetch all of them.

ajtowns commented at 11:00 PM on July 13, 2026: contributor

Two thoughts:

We could store a single number, 4000000*block_height + tx_offset.

Could we index blocks by download order (4M*block_seq + tx_offset) instead? Would probably require a block_seq to file/pos index to lookup (and maybe a hash->block_seq index as well?), but that doesn't seem terrible, and would make the txindex be reorg-independent.

Also, if we're doing a full reset of the txindex anyway, could we consider adding wtxids into the index as well? The net logs generally refers to txs by their wtxid these days, so looking up the tx details after the fact can be annoying. Having (txid, 0) be the key when txid=wtxid, and using two keys (txid, 1) and (wtxid, 2) when they're different could work okay if you want to allow people to query by just txid, just wtxid or either.

in src/index/txindex_key.h:31 in e467357cf2

  26 | +//! transaction's serialized byte offset from the start of that block (including the
  27 | +//! header), so the on-disk position is simply block_data_pos + tx_offset_in_block.
  28 | +//!
  29 | +//! Since the offset is always less than the maximum serialized block size, we pack
  30 | +//! the position into a single integer code = max_block_size * height + offset, and
  31 | +//! split it apart as (height = code / max_block_size, offset = code % max_block_size).

andrewtoth commented at 11:56 PM on July 13, 2026:

@ajtowns Threading here for your comment #35531 (comment).

Could we index blocks by download order (4M*block_seq + tx_offset) instead? Would probably require a block_seq to file/pos index to lookup (and maybe a hash->block_seq index as well?), but that doesn't seem terrible, and would make the txindex be reorg-independent.

One of the advantages of indexing by block height in the best chain is that we could allow running txindex while pruned. We could request the block from peers JIT if we don't have it (see #35531 (review)). That would be a major UX win IMO. That does require deleting reorged entries. May I ask what you think the benefit of keeping that feature is? If the user has the stale block hash they can still query by txid.

Also, if we're doing a full reset of the txindex anyway, could we consider adding wtxids into the index as well? The net logs generally refers to txs by their wtxid these days, so looking up the tx details after the fact can be annoying. Having (txid, 0) be the key when txid=wtxid, and using two keys (txid, 1)and(wtxid, 2)` when they're different could work okay if you want to allow people to query by just txid, just wtxid or either.

This would ~double the size of the index, correct? The main goal of this change is to reduce the on disk footprint. I would opt to keep the current txid only index. This also keeps the index backwards compatible, so users don't need to do a full reset. But, I think it would be fairly easy after the fact to add a wtxid config option that would also index wtxids?

ajtowns commented at 12:33 AM on July 14, 2026:

One of the advantages of indexing by block height in the best chain is that we could allow running txindex while pruned. We could request the block from peers JIT if we don't have it (see #35531 (comment)). That would be a major UX win IMO. That does require deleting reorged entries.

I think it doesn't require deleting reorged entries -- you'd need a block_seq -> block_hash, file, pos index, so the process for a tx in a reorged block would be "lookup txid, get block_seq. lookup block_seq, notice file is pruned. lookup block_hash, notice it's not in the active chain, error". You might want to delete entries as you prune the corresponding blocks anyway (lookup the block being pruned, notice it's not in the active chain, lookup each txid in that block, delete the entry) just to save space, of course?

I'm a bit skeptical about txindex with pruning, fwiw; an extra 25GB is a fair bit on very small machines, and automatic JIT requesting of blocks from random peers is a little bit concerning for privacy reasons? I suspect I wouldn't turn that feature on. Maybe if the txindex were limited to txs in non-pruned blocks, that might be interesting to me though.

May I ask what you think the benefit of keeping that feature is? If the user has the stale block hash they can still query by txid.

I don't have a specific use in mind; just that it avoids changing the RPC / being a breaking change, and keeps it easy to lookup txs. In general: if you've got a database to make it easy to look things up, it should be easy to look everything up, there shouldn't be special exceptions that are going to end up being confusing and making life difficult at the worst possible time.

[wtxid indexing] This would ~double the size of the index, correct?

Probably a fair bit less (lots of historical txs have txid=wtxid obviously), but right order of magnitude. I don't mind it being configurable. At least for me, more functionality at roughly the same size would be much better than reducing the size.

so users don't need to do a full reset.

If it's only 2h to regenerate the txindex from scratch, and that saves 30GB or gets you wtxids indexed as well, seems pretty worth doing to me. I suppose it would be even nicer if you could continue to use the old txindex while generating the new one, though.

andrewtoth commented at 1:38 AM on July 14, 2026:

an extra 25GB is a fair bit on very small machines

It doesn't need to be a small machine - today a txindex node cannot be run on a 1TB drive. If your laptop has a 2TB drive, running with txindex will take up half your drive. If you just want to run txindex you don't need all those blocks.

automatic JIT requesting of blocks from random peers is a little bit concerning for privacy reasons

Is this any different than syncing with compact block filters? The peer will not know which txid you are interested in, or even if you are requesting it for a txid lookup vs just calling getblockfrompeer.

l0rinc commented at 10:42 PM on July 14, 2026:

ajtowns commented at 11:26 PM on July 14, 2026:

Is this any different than syncing with compact block filters?

If you've got a light client requesting blocks from random peers, I think it's pretty similar. Defenses for compact block filters are only connecting to your own trusted full node, adding a bunch of false positive requests to disguise what you're actually interested in, and only connecting to tor nodes to help avoid you being identified. I don't think those are applicable here.

The places where we use that is in rescan/importdescriptors/restorewallet, but that's all going back to our own node, not pulling from the network, so isn't a privacy risk, aiui.

I think "people who call getblockfrompeer" is an extremely small anonymity set, that's not meaningfully relevant.

What's the benefit to having bitcoin-cli getrawtransaction XXXX work on a pruned node that doesn't have that transaction, versus just querying mempool.space?

I think it doesn't require deleting reorged entries -- you'd need a block_seq -> block_hash, file, pos index, ...

Oh, obviously we already have a block_hash -> file, pos index; so all we'd actually need is the block_seq -> block_hash mapping.

l0rinc commented at 11:59 PM on July 14, 2026:

adding a bunch of false positive requests to disguise what you're actually interested in

We could lower the prefix size, thereby increasing the collision rate and broadening the anonymity set :P

What's the benefit to having bitcoin-cli getrawtransaction XXXX work on a pruned node that doesn't have that transaction, versus just querying mempool.space?

For a single manual lookup, the main benefit is not having to disclose the txid to a third-party service. The larger benefit is for local wallet operations: with the header chain and a local compact-filter index, a pruned node could identify blocks matching wallet-known scripts during rescan/import/restore and fetch those missing blocks.

Maybe if the txindex were limited to txs in non-pruned blocks, that might be interesting to me though

Yes, I imagined adding the pruned block proxy as a disabled-by-default feature, with a small buffer of recently requested blocks kept in memory (or persisted to disk, as @andrewtoth suggested). We should have that fully prototyped before we merge this, both to be sure it's possible and to understand the tradeoffs.

This would ~double the size of the index, correct?

According to the AI overlords, a reasonable estimate is about 1.6× as many entries.

Could we index blocks by download order (4M*block_seq + tx_offset) instead?

I think we can reduce the key from 12 bytes (a one-byte tag, a five-byte hash prefix, and a six-byte position) to 11 by letting the hash prefix and position share a byte. With four tag values, this would not increase the hash-prefix collision rate; alternatively, we could use two tag values and accept twice as many collision candidates.

First, we don't need to support every potential block height forever. Using 21 bits supports heights through 2,097,151, which is enough for roughly another 21 years (I love the number 21).

We can also store the transaction offset in 21 bits by dividing it by two. This doesn't assume that some byte offsets are impossible; it folds each adjacent pair of offsets into one value. Valid transaction starts are at least 60 serialized bytes apart, so at most one member of the pair can be a transaction start. A lookup checks 2 * stored_offset and 2 * stored_offset + 1 and verifies the full txid (we already have a similar candidate-scanning loop for hash-prefix collisions).

That gives us a 42-bit position: a 21-bit block height plus a 21-bit offset divided by two. Equivalently, we can divide the current packed position by two (stored_position = (4,000,000 * height + offset) / 2).

To retain the current effective 40-bit hash prefix, we could use four database tag values selected by two bits of the salted hash. The remaining 38 hash bits and the 42-bit position then occupy ten bytes (2 tag bits + 38 inline bits + 42 position bits).

The requested txid hash suffix tells us exactly which tag to seek, so this still uses one LevelDB seek and the existing collision-scanning path. It reduces every key from 12 to 11 bytes without changing the current hash-prefix collision rate, at the cost of checking at most two adjacent transaction positions.

If four tag values feel too awkward, two tags can carry one hash bit instead (1 tag bit + 38 inline bits = 39 effective hash bits). That would double the hash-prefix collision rate compared with the current 40-bit prefix—roughly one extra candidate per 400 lookups instead of one per 800—but otherwise keeps the same 11-byte key and two-position lookup. So the tradeoff is essentially four tag values versus twice as many prefix collisions.

andrewtoth commented at 4:00 AM on July 15, 2026:

What's the benefit to having bitcoin-cli getrawtransaction XXXX work on a pruned node that doesn't have that transaction, versus just querying mempool.space?

The UX is superior for consumers. For instance, electrs can run with a pruned node, and would not have to special case a request to mempool.space if "tx is in a pruned block" error is returned.

ajtowns commented at 9:16 AM on July 15, 2026:

I don't follow -- electrs is documented as requiring a non-pruned node and doesn't need txindex as it creates its own indexes. Do you mean an (hypothetical?) electrs-like thing, that uses bitcoind's indexes directly? Is there a tracking issue/blog post/something describing what you're trying to work towards?

The comparison I was drawing in that context would be "just use mempool.space, don't install electrs at all" rather than "have electrs use mempool.space". The point of running electrs yourself is that you're not giving someone else hints as to which transactions/addresses are interesting to you; if you're reaching out to the network all the time in practice, that seems to be missing about 90% of the point to me. AFAICS if you want to keep that sort of info private but can't afford the storage to keep the full blockchain, you'd be much better off using a wallet db to preserve the txs you're interested in locally, with perhaps a one-off "download more blocks than just the ones I'm interested in" procedure when setting up a new wallet that has pre-existing txs.

I could see having a pruned node that covers the last 5 years of txs (~430GB vs ~800GB; plus indexes and utxo set) and a txindex that only covers the non-pruned txs being both private and useful (if your audit window is <5 years), but just having a wallet db seems better.

andrewtoth commented at 1:22 PM on July 15, 2026:

Yes, I meant a hypothetical electrs-like thing that could be built in the future. You're right, I should document the vision I have for this in a tracking issue and we can further brainstorm there. I think this doesn't need to be decided on in this PR, since your suggestion of using a sequence -> block hash index would still allow us to explore pruning in the future. If downloading blocks is controversial, we can just return a "tx is in a pruned block" error like we do for pruned blocks in getblock.

andrewtoth commented at 1:50 PM on July 15, 2026:

I think we can reduce the key from 12 bytes (a one-byte tag, a five-byte hash prefix, and a six-byte position) to 11 by letting the hash prefix and position share a byte. @l0rinc this is really cool, but I think it might not be worth the trade-off in complexity. Having to check for so many different false positives might be fairly tricky to review. Also, the ~2.1 million block limit seems a little shallow, especially if we use sequence instead of height. A lot of reorgs could get us to that height sooner, and on some testnets this might already not be enough.

andrewtoth commented at 12:58 AM on July 16, 2026:

@ajtowns I took your suggestion of indexing by sequence instead of height. This removes the breaking change. I did need the hash -> seq index as well, since we need to check whether a block has been indexed already and skip it. Otherwise a block reorged out then reorged back would get duplicate entries added, since it will have a different sequence number than previously.

<details><summary>I did not do the wtxid index. I think we could do that in a follow-up. It could look something like the following, but we would still need to add the config option, add unit and functional tests, and make sure the db can't have existing entries without wtxids.</summary>

diff --git a/src/index/txindex.cpp b/src/index/txindex.cpp
index e3500f5a54..709d1386ce 100644
--- a/src/index/txindex.cpp
+++ b/src/index/txindex.cpp
@@ -34,6 +34,7 @@
 #include <exception>
 #include <string>
 #include <utility>
+#include <variant>
 #include <vector>
 
 constexpr uint8_t DB_TXINDEX{'t'};
@@ -55,7 +56,7 @@ public:
     bool ReadTxPos(const Txid& txid, CDiskTxPos& pos) const;
 
     /// Write a block of transaction positions to the DB.
-    void WriteTxs(const interfaces::BlockInfo& block);
+    void WriteTxs(const interfaces::BlockInfo& block, bool index_wtxids);
 
     /// Used to hash the txid to compute the prefix.
     const PresaltedSipHasher m_hasher;
@@ -112,7 +113,7 @@ void TxIndex::DB::WriteBestBlock(CDBBatch& batch, const CBlockLocator& locator)
     batch.Write(DB_BEST_BLOCK_V2, locator);
 }
 
-void TxIndex::DB::WriteTxs(const interfaces::BlockInfo& block)
+void TxIndex::DB::WriteTxs(const interfaces::BlockInfo& block, bool index_wtxids)
 {
     if (Exists(txindex::BlockHashKey{block.hash})) return;
 
@@ -125,18 +126,20 @@ void TxIndex::DB::WriteTxs(const interfaces::BlockInfo& block)
     batch.Write(DB_NEXT_BLOCK_SEQ, block_seq + 1);
     uint32_t tx_offset_in_block{static_cast<uint32_t>(GetSerializeSize(CBlockHeader{})) + GetSizeOfCompactSize(block.data->vtx.size())};
     for (const auto& tx : block.data->vtx) {
-        const txindex::DBKey key{txindex::CreateKeyPrefix(m_hasher, tx->GetHash()),
-                                 txindex::BlockTxPosition{block_seq, tx_offset_in_block}};
+        const txindex::BlockTxPosition pos{block_seq, tx_offset_in_block};
         // The tx position is encoded in the key, so the value is intentionally
         // empty. A 0-length byte array avoids the spurious '\0' that "" would store.
-        batch.Write(key, std::array<std::byte, 0>{});
+        batch.Write(txindex::DBKey{txindex::CreateKeyPrefix(m_hasher, tx->GetHash()), pos}, std::array<std::byte, 0>{});
+        if (index_wtxids && tx->HasWitness()) {
+            batch.Write(txindex::DBKey{txindex::CreateKeyPrefix(m_hasher, tx->GetWitnessHash()), pos}, std::array<std::byte, 0>{});
+        }
         tx_offset_in_block += ::GetSerializeSize(TX_WITH_WITNESS(*tx));
     }
     WriteBatch(batch);
 }
 
-TxIndex::TxIndex(std::unique_ptr<interfaces::Chain> chain, size_t n_cache_size, bool f_memory, bool f_wipe)
-    : BaseIndex(std::move(chain), "txindex", "txidx"), m_db(std::make_unique<TxIndex::DB>(n_cache_size, f_memory, f_wipe))
+TxIndex::TxIndex(std::unique_ptr<interfaces::Chain> chain, size_t n_cache_size, bool f_memory, bool f_wipe, bool index_wtxids)
+    : BaseIndex(std::move(chain), "txindex", "txidx"), m_db(std::make_unique<TxIndex::DB>(n_cache_size, f_memory, f_wipe)), m_index_wtxids{index_wtxids}
 {}
 
 TxIndex::~TxIndex() = default;
@@ -147,15 +150,15 @@ bool TxIndex::CustomAppend(const interfaces::BlockInfo& block)
     if (block.height == 0) return true;
 
     assert(block.data);
-    m_db->WriteTxs(block);
+    m_db->WriteTxs(block, m_index_wtxids);
     return true;
 }
 
 BaseIndex::DB& TxIndex::GetDB() const { return *m_db; }
 
-bool TxIndex::FindTx(const Txid& tx_hash, uint256& block_hash, CTransactionRef& tx) const
+bool TxIndex::FindTx(const GenTxid& gtxid, uint256& block_hash, CTransactionRef& tx) const
 {
-    const txindex::TxHashKeyPrefix prefix{txindex::CreateKeyPrefix(m_db->m_hasher, tx_hash)};
+    const txindex::TxHashKeyPrefix prefix{std::visit([&](const auto& id) { return txindex::CreateKeyPrefix(m_db->m_hasher, id); }, gtxid)};
     std::unique_ptr<CDBIterator> it{m_db->NewIterator()};
     it->Seek(std::pair{txindex::DB_TXINDEX_HASHED, prefix});
     txindex::DBKey key{prefix, {}};
@@ -206,7 +209,8 @@ bool TxIndex::FindTx(const Txid& tx_hash, uint256& block_hash, CTransactionRef&
             LogError("Deserialize or I/O error - %s", e.what());
             return false;
         }
-        if (tx->GetHash() == tx_hash) {
+        const uint256& txid{gtxid.IsWtxid() ? tx->GetWitnessHash().ToUint256() : tx->GetHash().ToUint256()};
+        if (txid == gtxid.ToUint256()) {
             block_hash = candidate.block_index->GetBlockHash();
             return true;
         }
@@ -214,6 +218,7 @@ bool TxIndex::FindTx(const Txid& tx_hash, uint256& block_hash, CTransactionRef&
 
     tx.reset();
     if (!m_db->m_has_legacy) return false;
+    const Txid tx_hash{*std::get_if<Txid>(&gtxid)};
     // Fall back to legacy if no hashed entry matched. This makes misses pay an
     // extra lookup, but keeps existing full-txid entries readable after upgrade.
     CDiskTxPos postx;
diff --git a/src/index/txindex.h b/src/index/txindex.h
index fff82342d9..7cdaae81d7 100644
--- a/src/index/txindex.h
+++ b/src/index/txindex.h
@@ -34,6 +34,7 @@ protected:
 private:
     friend class txindex_tests::TxIndexTest;
     const std::unique_ptr<DB> m_db;
+    const bool m_index_wtxids;
 
     bool AllowPrune() const override { return false; }
 
@@ -44,18 +45,18 @@ protected:
 
 public:
     /// Constructs the index, which becomes available to be queried.
-    explicit TxIndex(std::unique_ptr<interfaces::Chain> chain, size_t n_cache_size, bool f_memory = false, bool f_wipe = false);
+    explicit TxIndex(std::unique_ptr<interfaces::Chain> chain, size_t n_cache_size, bool f_memory = false, bool f_wipe = false, bool index_wtxids = false);
 
     // Destructor is declared because this class contains a unique_ptr to an incomplete type.
     virtual ~TxIndex() override;
 
     /// Look up a transaction by hash.
     ///
-    /// [@param](/bitcoin-bitcoin/contributor/param/)[in]   tx_hash  The hash of the transaction to be returned.
+    /// [@param](/bitcoin-bitcoin/contributor/param/)[in]   gtxid  The txid or wtxid of the transaction to be returned.
     /// [@param](/bitcoin-bitcoin/contributor/param/)[out]  block_hash  The hash of the block the transaction is found in. Undefined if false is returned.
     /// [@param](/bitcoin-bitcoin/contributor/param/)[out]  tx  The transaction itself. Undefined if false is returned.
     /// [@return](/bitcoin-bitcoin/contributor/return/)  true if transaction is found, false otherwise
-    [[nodiscard]] bool FindTx(const Txid& tx_hash, uint256& block_hash, CTransactionRef& tx) const;
+    [[nodiscard]] bool FindTx(const GenTxid& gtxid, uint256& block_hash, CTransactionRef& tx) const;
 };
 
 /// The global transaction index, used in GetTransaction. May be null.
diff --git a/src/index/txindex_key.h b/src/index/txindex_key.h
index 75ef9c1dad..3b33ae7364 100644
--- a/src/index/txindex_key.h
+++ b/src/index/txindex_key.h
@@ -87,6 +87,7 @@ struct BlockHashKey {
 
 using TxHashKeyPrefix = std::array<std::byte, 5>;
 
+template <TxidOrWtxid Txid>
 inline TxHashKeyPrefix CreateKeyPrefix(const PresaltedSipHasher& hasher, const Txid& txid)
 {
     std::array<std::byte, sizeof(uint64_t)> be_hash;

</details>

l0rinc commented at 3:16 AM on July 17, 2026:

The new sequence mappings preserve stale transaction lookups and avoid duplicate rows when a known block reconnects, but neither transition seems to be covered by tests.

Could we disconnect and reconnect an indexed block and verify that its transaction keeps one stable position throughout?

<details><summary>test sequence reorgs</summary>

BOOST_FIXTURE_TEST_CASE(txindex_sequence_reorg, TestChain100Setup)
{
    TxIndex txindex(interfaces::MakeChain(m_node), /*n_cache_size=*/1_MiB, /*f_memory=*/true);
    BOOST_REQUIRE(txindex.Init());
    txindex.Sync();

    const CScript& coinbase_script{m_coinbase_txns[0]->vout[0].scriptPubKey};
    const CBlock tx_block{CreateAndProcessBlock({}, coinbase_script)};
    const Txid txid{tx_block.vtx[0]->GetHash()};
    const uint256 tx_block_hash{tx_block.GetHash()};
    CreateAndProcessBlock({}, coinbase_script);
    BOOST_REQUIRE(txindex.BlockUntilSyncedToCurrentChain());

    CDBWrapper& db{TxIndexTest::GetDB(txindex)};
    const auto prefix{txindex::CreateKeyPrefix(ReadHasher(db), txid)};
    const auto original_bucket{BucketPositions(db, prefix)};
    BOOST_REQUIRE_EQUAL(original_bucket.size(), 1U);

    const auto check_lookup{[&] {
        CTransactionRef tx;
        uint256 block_hash;
        BOOST_REQUIRE(txindex.FindTx(txid, block_hash, tx));
        BOOST_CHECK(block_hash == tx_block_hash);
    }};
    check_lookup();

    ChainstateManager& chainman{*m_node.chainman};
    Chainstate& chainstate{chainman.ActiveChainstate()};
    CBlockIndex* tx_block_index{WITH_LOCK(cs_main, return chainman.m_blockman.LookupBlockIndex(tx_block_hash))};
    BOOST_REQUIRE(tx_block_index);

    BlockValidationState state;
    BOOST_REQUIRE(chainstate.InvalidateBlock(state, tx_block_index));
    // A distinct coinbase prevents recreating the invalidated block. The
    // two-block original branch stays heavier and reconnects below.
    const CBlock replacement_block{CreateAndProcessBlock({}, CScript() << OP_TRUE)};
    BOOST_REQUIRE(WITH_LOCK(cs_main, return chainman.ActiveChain().Tip()->GetBlockHash() == replacement_block.GetHash()));
    BOOST_REQUIRE(txindex.BlockUntilSyncedToCurrentChain());

    check_lookup();
    BOOST_CHECK(!WITH_LOCK(cs_main, return chainman.ActiveChain().Contains(*tx_block_index)));

    {
        LOCK(cs_main);
        chainstate.ResetBlockFailureFlags(tx_block_index);
        chainman.RecalculateBestHeader();
    }
    BOOST_REQUIRE(chainstate.ActivateBestChain(state));
    BOOST_REQUIRE(WITH_LOCK(cs_main, return chainman.ActiveChain().Contains(*tx_block_index)));
    BOOST_REQUIRE(txindex.BlockUntilSyncedToCurrentChain());

    BOOST_CHECK(BucketPositions(db, prefix) == original_bucket);
    check_lookup();

    txindex.Stop();
}

</details>

l0rinc referenced this in commit 5d1e68994f on Jul 15, 2026

andrewtoth force-pushed on Jul 15, 2026

andrewtoth force-pushed on Jul 16, 2026

DrahtBot added the label CI failed on Jul 16, 2026

DrahtBot commented at 12:42 AM on July 16, 2026: contributor

🚧 At least one of the CI tasks failed. Task iwyu: https://github.com/bitcoin/bitcoin/actions/runs/29459397665/job/87499526508 LLM reason (✨ experimental): CI failed because IWYU reported/auto-fixed an include issue (non-empty git diff) in src/index/txindex.h, triggering “Failure generated from IWYU”.

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

DrahtBot removed the label CI failed on Jul 16, 2026

in src/index/txindex.cpp:109 in f5aa6b4393 outdated

 107 | +
 108 |      CDBBatch batch(*this);
 109 | -    for (const auto& [txid, pos] : v_pos) {
 110 | -        batch.Write(std::make_pair(DB_TXINDEX, txid.ToUint256()), pos);
 111 | +    batch.Write(txindex::BlockHashKey{block.hash}, block_seq);
 112 | +    batch.Write(txindex::BlockSeqKey{block_seq}, block.hash);

l0rinc commented at 2:39 AM on July 17, 2026:

f5aa6b4 txindex: hash key prefixes and pack block positions:

TxIndex still says it records filesystem locations, while the new rows store a connected-block sequence plus a serialized-block offset.

<details><summary>document logical tx positions</summary>

diff --git a/src/index/txindex.h b/src/index/txindex.h
--- a/src/index/txindex.h	(revision 30d4787950a74f124bb6ab40d4a752edba5df5e2)
+++ b/src/index/txindex.h	(revision 02443da7abdbf63cf2d30e861fe2132dca08acd1)
@@ -23,8 +23,8 @@
 
 /**
  * TxIndex is used to look up transactions included in the blockchain by hash.
- * The index is written to a LevelDB database and records the filesystem
- * location of each transaction by transaction hash.
+ * The index is written to a LevelDB database and records the block sequence
+ * number and serialized block offset of each transaction by transaction hash.
  */
 class TxIndex final : public BaseIndex
 {

</details>

l0rinc changes_requested

l0rinc commented at 3:54 AM on July 17, 2026: contributor

Approach looks good overall, but before merge I’d like sequence reorg coverage, simpler and explicitly tested position and key serialization, and the PR description updated to consistently describe the new sequence-based layout rather than the old height-based encoding.

andrewtoth force-pushed on Jul 17, 2026

in src/test/txindex_tests.cpp:288 in c4c15e115c

 283 | +    BOOST_CHECK(tx_disk->GetHash() == unique_txid);
 284 | +    BOOST_CHECK(block_hash == stale_block_hash);
 285 | +
 286 | +    CDBWrapper& db{TxIndexTest::GetDB(txindex)};
 287 | +    const auto bucket{BucketPositions(db, txindex::CreateKeyPrefix(ReadHasher(db), unique_txid))};
 288 | +    BOOST_CHECK_EQUAL(bucket.size(), 1U);

l0rinc commented at 3:52 AM on July 18, 2026:

c4c15e1 tests: cover txindex hash prefix collisions and legacy fallback:

nit: could we capture the initial bucket and compare it after reconnection?

<details><summary>verify stable reorg positions</summary>

diff --git a/src/test/txindex_tests.cpp b/src/test/txindex_tests.cpp
--- a/src/test/txindex_tests.cpp	(revision 1afeef4350d8e531c83ed9ebe4700aa77ca1a226)
+++ b/src/test/txindex_tests.cpp	(revision 1e3664615f8915ee6841a741420e5b510d88f4e1)
@@ -259,6 +259,11 @@
     BOOST_CHECK(tx_disk->GetHash() == unique_txid);
     BOOST_CHECK(block_hash == stale_block_hash);
 
+    CDBWrapper& db{TxIndexTest::GetDB(txindex)};
+    const auto prefix{txindex::CreateKeyPrefix(ReadHasher(db), unique_txid)};
+    const auto original_bucket{BucketPositions(db, prefix)};
+    BOOST_REQUIRE_EQUAL(original_bucket.size(), 1U);
+
     ChainstateManager& chainman{*m_node.chainman};
 
     // Invalidate the block holding the unique transaction, then mine a longer branch.
@@ -299,14 +304,12 @@
     BOOST_CHECK(WITH_LOCK(cs_main, return chainman.ActiveChain().Tip()->GetBlockHash()) == stale_block_hash);
 
     // The transaction is found in the reconnected (again active) block, and its
-    // bucket still holds a single entry.
+    // bucket keeps the original position.
     BOOST_REQUIRE(txindex.FindTx(unique_txid, block_hash, tx_disk));
     BOOST_CHECK(tx_disk->GetHash() == unique_txid);
     BOOST_CHECK(block_hash == stale_block_hash);
 
-    CDBWrapper& db{TxIndexTest::GetDB(txindex)};
-    const auto bucket{BucketPositions(db, txindex::CreateKeyPrefix(ReadHasher(db), unique_txid))};
-    BOOST_CHECK_EQUAL(bucket.size(), 1U);
+    BOOST_CHECK(BucketPositions(db, prefix) == original_bucket);
 
     txindex.Stop();
 }

</details>

l0rinc approved

l0rinc commented at 3:52 AM on July 18, 2026: contributor

LGTM, will review in more detail next week

andrewtoth force-pushed on Jul 19, 2026

andrewtoth commented at 11:21 PM on July 19, 2026: contributor

Thanks for all the reviews. Addressed all @l0rinc's suggestions. The latest version uses sequence numbers instead of height (thanks @ajtowns) so still indexes txs in stale blocks (cc @Sjors). It also simplifies the key by using 3 bytes for the sequence and block position, instead of the previous packed position suggested by @sipa.

in src/index/txindex.cpp:118 in 9f99da1649 outdated

 116 | +        const txindex::DBKey key{txindex::CreateKeyPrefix(m_hasher, tx->GetHash()),
 117 | +                                 txindex::BlockTxPosition{block_seq, tx_offset_in_block}};
 118 | +        // The tx position is encoded in the key, so the value is intentionally
 119 | +        // empty. A 0-length byte array avoids the spurious '\0' that "" would store.
 120 | +        batch.Write(key, std::array<std::byte, 0>{});
 121 | +        tx_offset_in_block += ::GetSerializeSize(TX_WITH_WITNESS(*tx));

l0rinc commented at 11:34 PM on July 19, 2026:

9f99da1 txindex: hash key prefixes and pack block positions:

nit: we don't need the last one here - not sure if special-casing the last iteration is worth it:

        if (&tx != &block.data->vtx.back()) tx_offset_in_block += tx->ComputeTotalSize();

Even if the condition isn't needed, tx->ComputeTotalSize() seems more on point.

in src/index/txindex.cpp:160 in 9f99da1649 outdated

 163 | +    struct Candidate {
 164 | +        const CBlockIndex* block_index;
 165 | +        uint32_t tx_offset_in_block;
 166 | +        bool active;
 167 | +    };
 168 | +    std::vector<Candidate> candidates;

l0rinc commented at 11:44 PM on July 19, 2026:

9f99da1 txindex: hash key prefixes and pack block positions:

nit: can we reserve the upper bound here?

    std::vector<Candidate> candidates;
    candidates.reserve(positions.size());

andrewtoth commented at 12:38 AM on July 26, 2026:

I'm not sure it's worth trying to optimize this. We expect to have 1 entry most of the time.

andrewtoth force-pushed on Jul 20, 2026

in src/test/txindex_tests.cpp:171 in 20a8aaef4e

 166 | +    // Read the last coinbase's encoded position straight from its bucket.
 167 | +    const auto fake_bucket{BucketPositions(db, fake_prefix)};
 168 | +    BOOST_REQUIRE_EQUAL(fake_bucket.size(), 1U);
 169 | +    const txindex::BlockTxPosition fake_pos{fake_bucket.front()};
 170 | +
 171 | +    db.Write(txindex::DBKey{target_prefix, fake_pos}, "");

l0rinc commented at 8:09 PM on July 20, 2026:

We're writing zero-length values now:

    db.Write(txindex::DBKey{target_prefix, fake_pos}, std::array<std::byte, 0>{});

in src/test/txindex_tests.cpp:237 in 20a8aaef4e

 232 | +    txindex.Stop();
 233 | +}
 234 | +
 235 | +BOOST_FIXTURE_TEST_CASE(txindex_reorg_keeps_stale_entries, TestChain100Setup)
 236 | +{
 237 | +    TxIndex txindex(interfaces::MakeChain(m_node), 1_MiB, true);

l0rinc commented at 8:10 PM on July 20, 2026:

nit: could we name these primitive arguments? In particular, whether this test deliberately uses an in-memory database isn't obvious at the call site.

    TxIndex txindex(interfaces::MakeChain(m_node), /*n_cache_size=*/1_MiB, /*f_memory=*/true);

in src/index/txindex.cpp:191 in 20a8aaef4e

 194 | +        const FlatFilePos tx_position{block_index->nFile, block_index->nDataPos + pos.tx_offset_in_block};
 195 | +        candidates.emplace_back(tx_position, block_hash, m_chainstate->m_chain.Contains(*block_index));
 196 | +    }
 197 | +
 198 | +    // Try candidates in the active chain first.
 199 | +    std::stable_partition(candidates.begin(), candidates.end(), [](const Candidate& c) { return c.in_active_chain; });

l0rinc commented at 8:10 PM on July 20, 2026:

Could we pass the candidate range directly to std::ranges::stable_partition, matching the range algorithms already used above?

    std::ranges::stable_partition(candidates, [](auto& c) { return c.in_active_chain; });

in src/test/txindex_tests.cpp:148 in 20a8aaef4e

 142 | @@ -66,4 +143,175 @@ BOOST_FIXTURE_TEST_CASE(txindex_initial_sync, TestChain100Setup)
 143 |      txindex.Stop();
 144 |  }
 145 |  
 146 | +BOOST_FIXTURE_TEST_CASE(txindex_collision_scan_path, TestChain100Setup)
 147 | +{
 148 | +    TxIndex txindex(interfaces::MakeChain(m_node), 1_MiB, true);

l0rinc commented at 4:56 AM on July 21, 2026:

This test opens txindex in memory https://github.com/bitcoin/bitcoin/blob/20a8aaef4ee2c7f0830403bb92f1efa19fc99f5b/src/test/txindex_tests.cpp#L148

Consequently, !f_memory is false and HasKeyStartingWith() is skipped: https://github.com/bitcoin/bitcoin/blob/20a8aaef4ee2c7f0830403bb92f1efa19fc99f5b/src/index/txindex.cpp#L78 This initializes m_has_legacy to false: https://github.com/bitcoin/bitcoin/blob/20a8aaef4ee2c7f0830403bb92f1efa19fc99f5b/src/index/txindex.cpp#L92 The legacy fallback therefore returns early: https://github.com/bitcoin/bitcoin/blob/20a8aaef4ee2c7f0830403bb92f1efa19fc99f5b/src/index/txindex.cpp#L212

As shown by code coverage: <img width="636" height="184" alt="Image" src="https://github.com/user-attachments/assets/4bf1aaea-7e66-4c84-bb31-5f7a1be92836" />

Could we run this test on disk to also exercise the production path that determines whether legacy rows exist? Please check if it applies to other cases as well.

l0rinc commented at 9:31 PM on July 21, 2026:

// A database created fresh by this version cannot contain legacy entries, so // lookups skip the legacy fallback: drop the last coinbase's hashed entry and // re-add it under the old 't' + txid schema (a physical CDiskTxPos), then // confirm the lookup misses even though the legacy row exists.

This is why I added the comment - it's misleading, the old entry isn't skipped because we started new but because we're in-memory. The comment as written implies we're validating "new on-disk db detection", but what it's actually exercising is the "in-memory, m_has_legacy=false" path.

in src/index/txindex.cpp:159 in 20a8aaef4e

 162 |  BaseIndex::DB& TxIndex::GetDB() const { return *m_db; }
 163 |  
 164 |  bool TxIndex::FindTx(const Txid& tx_hash, uint256& block_hash, CTransactionRef& tx) const
 165 |  {
 166 | +    const txindex::TxHashKeyPrefix prefix{txindex::CreateKeyPrefix(m_db->m_hasher, tx_hash)};
 167 | +    std::unique_ptr<CDBIterator> it{m_db->NewIterator()};

l0rinc commented at 5:32 AM on July 21, 2026:

Can we release the iterator before the slower candidate resolution and block file reads below?

std::vector<txindex::BlockTxPosition> positions;
{
    std::unique_ptr<CDBIterator> it{m_db->NewIterator()};
    it->Seek(std::pair{txindex::DB_TXINDEX_HASHED, prefix});
    txindex::DBKey key{prefix, {}};
    for (; it->Valid() && it->GetKey(key) && key.hash_prefix == prefix; it->Next()) {
        positions.emplace_back(key.pos);
    }
}

l0rinc changes_requested

in src/index/txindex.h:51 in 24dfbd081a

  47 | @@ -48,10 +48,10 @@ class TxIndex final : public BaseIndex
  48 |      /// Look up a transaction by hash.
  49 |      ///
  50 |      /// @param[in]   tx_hash  The hash of the transaction to be returned.
  51 | -    /// @param[out]  block_hash  The hash of the block the transaction is found in.
  52 | -    /// @param[out]  tx  The transaction itself.
  53 | +    /// @param[out]  block_hash  The hash of the block the transaction is found in. Undefined if false is returned.

l0rinc commented at 7:47 PM on July 21, 2026:

24dfbd0 txindex: make TxIndex::FindTx [[nodiscard]]:

nit: undefined has UB connotations, unspecified may be slightly better. Making it unchanged would be even better...

<details><summary>preserve failed FindTx outputs</summary>

diff --git a/src/index/txindex.cpp b/src/index/txindex.cpp
index 48391038b2..25b87394ba 100644
--- a/src/index/txindex.cpp
+++ b/src/index/txindex.cpp
@@ -194,19 +194,20 @@ bool TxIndex::FindTx(const Txid& tx_hash, uint256& block_hash, CTransactionRef&
             LogError("OpenBlockFile failed");
             return false;
         }
+        CTransactionRef candidate_tx;
         try {
-            file >> TX_WITH_WITNESS(tx);
+            file >> TX_WITH_WITNESS(candidate_tx);
         } catch (const std::exception& e) {
             LogError("Deserialize or I/O error - %s", e.what());
             return false;
         }
-        if (tx->GetHash() == tx_hash) {
+        if (candidate_tx->GetHash() == tx_hash) {
+            tx = std::move(candidate_tx);
             block_hash = candidate.block_hash;
             return true;
         }
     }
 
-    tx.reset();
     if (!m_db->m_has_legacy) return false;
     // Fall back to legacy if no hashed entry matched. This makes misses pay an
     // extra lookup, but keeps existing full-txid entries readable after upgrade.
@@ -221,18 +222,20 @@ bool TxIndex::FindTx(const Txid& tx_hash, uint256& block_hash, CTransactionRef&
         return false;
     }
     CBlockHeader header;
+    CTransactionRef candidate_tx;
     try {
         file >> header;
         file.seek(postx.nTxOffset, SEEK_CUR);
-        file >> TX_WITH_WITNESS(tx);
+        file >> TX_WITH_WITNESS(candidate_tx);
     } catch (const std::exception& e) {
         LogError("Deserialize or I/O error - %s", e.what());
         return false;
     }
-    if (tx->GetHash() != tx_hash) {
+    if (candidate_tx->GetHash() != tx_hash) {
         LogError("txid mismatch");
         return false;
     }
+    tx = std::move(candidate_tx);
     block_hash = header.GetHash();
     return true;
 }
diff --git a/src/index/txindex.h b/src/index/txindex.h
index 7011fb358a..22125ec079 100644
--- a/src/index/txindex.h
+++ b/src/index/txindex.h
@@ -52,8 +52,8 @@ public:
     /// Look up a transaction by hash.
     ///
     /// [@param](/bitcoin-bitcoin/contributor/param/)[in]   tx_hash  The hash of the transaction to be returned.
-    /// [@param](/bitcoin-bitcoin/contributor/param/)[out]  block_hash  The hash of the block the transaction is found in. Undefined if false is returned.
-    /// [@param](/bitcoin-bitcoin/contributor/param/)[out]  tx  The transaction itself. Undefined if false is returned.
+    /// [@param](/bitcoin-bitcoin/contributor/param/)[out]  block_hash  The hash of the block the transaction is found in. Unchanged if false is returned.
+    /// [@param](/bitcoin-bitcoin/contributor/param/)[out]  tx  The transaction itself. Unchanged if false is returned.
     /// [@return](/bitcoin-bitcoin/contributor/return/)  true if transaction is found, false otherwise
     [[nodiscard]] bool FindTx(const Txid& tx_hash, uint256& block_hash, CTransactionRef& tx) const;
 };

</details>

in src/index/txindex.cpp:63 in ef564e622e outdated

  59 | @@ -56,6 +60,21 @@ bool TxIndex::DB::ReadTxPos(const Txid& txid, CDiskTxPos& pos) const
  60 |      return Read(std::make_pair(DB_TXINDEX, txid.ToUint256()), pos);
  61 |  }
  62 |  
  63 | +CBlockLocator TxIndex::DB::ReadBestBlock() const

l0rinc commented at 7:59 PM on July 21, 2026:

ef564e6 txindex: use a new block locator for downgrade safety:

Given that this isn't read often and that we're planning on migrating other indexes similarly (e.g. txospenderindex), consider adding the V2 check to the non-virtual parent method. It makes sense to specialize WriteBestBlock, but I think ReadBestBlock changes could go upstream to DB:

<details><summary>centralize and test locator fallback</summary>

diff --git a/src/index/base.cpp b/src/index/base.cpp
index 5820448bb7..781b91921d 100644
--- a/src/index/base.cpp
+++ b/src/index/base.cpp
@@ -79,12 +79,9 @@ BaseIndex::DB::DB(const fs::path& path, size_t n_cache_size, bool f_memory, bool
 CBlockLocator BaseIndex::DB::ReadBestBlock() const
 {
     CBlockLocator locator;
-
-    bool success = Read(DB_BEST_BLOCK, locator);
-    if (!success) {
+    if (!Read(DB_BEST_BLOCK_V2, locator) && !Read(DB_BEST_BLOCK, locator)) {
         locator.SetNull();
     }
-
     return locator;
 }
 
diff --git a/src/index/base.h b/src/index/base.h
index 731bbd26ba..6915b93255 100644
--- a/src/index/base.h
+++ b/src/index/base.h
@@ -63,6 +63,9 @@ protected:
     */
     class DB : public CDBWrapper
     {
+    protected:
+        inline static constexpr std::string DB_BEST_BLOCK_V2{"best_block_v2"};
+
     public:
         DB(const fs::path& path, size_t n_cache_size,
            bool f_memory = false, bool f_wipe = false, bool f_obfuscate = false, bool f_bloom = true);
@@ -70,7 +73,7 @@ protected:
 
         /// Read block locator of the chain that the index is in sync with.
         /// Note, the returned locator will be empty if no record exists.
-        virtual CBlockLocator ReadBestBlock() const;
+        CBlockLocator ReadBestBlock() const;
 
         /// Write block locator of the chain that the index is in sync with.
         virtual void WriteBestBlock(CDBBatch& batch, const CBlockLocator& locator);
diff --git a/src/index/txindex.cpp b/src/index/txindex.cpp
index 6351dc6e23..0e6d3dad57 100644
--- a/src/index/txindex.cpp
+++ b/src/index/txindex.cpp
@@ -36,8 +36,6 @@
 #include <utility>
 #include <vector>
 
-const std::string DB_BEST_BLOCK_V2{"best_block_v2"};
-
 std::unique_ptr<TxIndex> g_txindex;
 
 
@@ -60,7 +58,6 @@ public:
     /// Whether the database contains any legacy ('t' + txid) entries.
     const bool m_has_legacy;
 
-    CBlockLocator ReadBestBlock() const override;
     void WriteBestBlock(CDBBatch& batch, const CBlockLocator& locator) override;
 
 private:
@@ -94,16 +91,6 @@ bool TxIndex::DB::ReadTxPos(const Txid& txid, CDiskTxPos& pos) const
     return Read(txindex::LegacyTxKey(txid), pos);
 }
 
-CBlockLocator TxIndex::DB::ReadBestBlock() const
-{
-    CBlockLocator locator;
-    if (Read(DB_BEST_BLOCK_V2, locator)) {
-        return locator;
-    }
-    // If we don't have a locator yet, start from the legacy best block.
-    return BaseIndex::DB::ReadBestBlock();
-}
-
 void TxIndex::DB::WriteBestBlock(CDBBatch& batch, const CBlockLocator& locator)
 {
     batch.Write(DB_BEST_BLOCK_V2, locator);
diff --git a/src/test/txindex_tests.cpp b/src/test/txindex_tests.cpp
index 46b6daae1b..da0fd02ca3 100644
--- a/src/test/txindex_tests.cpp
+++ b/src/test/txindex_tests.cpp
@@ -41,6 +41,14 @@ class TxIndexTest
 {
 public:
     static CDBWrapper& GetDB(const TxIndex& txindex) { return static_cast<CDBWrapper&>(txindex.GetDB()); }
+    static CBlockLocator ReadBestBlock(const TxIndex& txindex) { return txindex.GetDB().ReadBestBlock(); }
+    static void WriteBestBlock(const TxIndex& txindex, const CBlockLocator& locator)
+    {
+        auto& db{txindex.GetDB()};
+        CDBBatch batch{db};
+        db.WriteBestBlock(batch, locator);
+        db.WriteBatch(batch);
+    }
 };
 
 namespace {
@@ -143,6 +151,26 @@ BOOST_FIXTURE_TEST_CASE(txindex_initial_sync, TestChain100Setup)
     txindex.Stop();
 }
 
+BOOST_FIXTURE_TEST_CASE(txindex_locator_upgrade, TestChain100Setup)
+{
+    const CBlockLocator legacy_locator{{m_coinbase_txns.front()->GetHash().ToUint256()}};
+    const CBlockLocator versioned_locator{{m_coinbase_txns.back()->GetHash().ToUint256()}};
+    {
+        CDBWrapper db{DBParams{.path = gArgs.GetDataDirNet() / "indexes" / "txindex", .cache_bytes = 1_MiB}};
+        db.Write(uint8_t{'B'}, legacy_locator);
+    }
+
+    TxIndex txindex(interfaces::MakeChain(m_node), /*n_cache_size=*/1_MiB, /*f_memory=*/false);
+    BOOST_CHECK(TxIndexTest::ReadBestBlock(txindex).vHave == legacy_locator.vHave);
+
+    TxIndexTest::WriteBestBlock(txindex, versioned_locator);
+    BOOST_CHECK(TxIndexTest::ReadBestBlock(txindex).vHave == versioned_locator.vHave);
+
+    CBlockLocator stored_legacy_locator;
+    BOOST_REQUIRE(TxIndexTest::GetDB(txindex).Read(uint8_t{'B'}, stored_legacy_locator));
+    BOOST_CHECK(stored_legacy_locator.vHave == legacy_locator.vHave);
+}
+
 BOOST_FIXTURE_TEST_CASE(txindex_collision_scan_path, TestChain100Setup)
 {
     TxIndex txindex(interfaces::MakeChain(m_node), /*n_cache_size=*/1_MiB, /*f_memory=*/false);

(note that this may not apply cleanly to you since I have local suggestion commits already - but you get the picture)

</details>

nit: my IDE is warning here:

Destructor 'BaseIndex::DB::~DB()' hides a non-virtual function from class 'CDBWrapper'

It's likely a false-positive, just ignore if you don't think it's important.

andrewtoth commented at 12:40 AM on July 26, 2026:

Not sure about the block locator change. Does this mean all indexes will be updated to a new locator? Maybe we can reconsider this when we want to do a similar upgrade for other indexes?

Also, agree it's a false positive.

l0rinc commented at 12:06 AM on July 27, 2026:

We have to make sure downgrading is safe. Could we verify that ReadBestBlock() falls back to B, prefers best_block_v2 once it exists, and that WriteBestBlock() leaves B unchanged?

This pins this PR’s txindex-specific behavior without moving locator handling into BaseIndex (which I would still prefer, but that’s just a nit).

<details><summary>test txindex locator upgrade</summary>

diff --git a/src/test/txindex_tests.cpp b/src/test/txindex_tests.cpp
index a75779bcf2..ce375fc50d 100644
--- a/src/test/txindex_tests.cpp
+++ b/src/test/txindex_tests.cpp
@@ -42,7 +42,15 @@ BOOST_AUTO_TEST_SUITE(txindex_tests)
 class TxIndexTest
 {
 public:
-    static CDBWrapper& GetDB(const TxIndex& txindex) { return static_cast<CDBWrapper&>(txindex.GetDB()); }
+    static CDBWrapper& GetDB(const TxIndex& txindex) { return txindex.GetDB(); }
+    static CBlockLocator ReadBestBlock(const TxIndex& txindex) { return txindex.GetDB().ReadBestBlock(); }
+    static void WriteBestBlock(const TxIndex& txindex, const CBlockLocator& locator)
+    {
+        auto& db{txindex.GetDB()};
+        CDBBatch batch{db};
+        db.WriteBestBlock(batch, locator);
+        db.WriteBatch(batch);
+    }
 };

and

BOOST_FIXTURE_TEST_CASE(txindex_locator_upgrade, TestChain100Setup)
{
    uint256 legacy_hash, new_hash;
    {
        LOCK(cs_main);
        legacy_hash = Assert(m_node.chainman->ActiveChain()[1])->GetBlockHash();
        new_hash = Assert(m_node.chainman->ActiveChain().Tip())->GetBlockHash();
    }
    CBlockLocator legacy_locator{{legacy_hash}}, new_locator{{new_hash}};
    { CDBWrapper{DBParams{.path = gArgs.GetDataDirNet() / "indexes" / "txindex", .cache_bytes = 1_MiB}}.Write(uint8_t{'B'}, legacy_locator); }

    TxIndex txindex(interfaces::MakeChain(m_node), /*n_cache_size=*/1_MiB, /*f_memory=*/false);
    BOOST_CHECK(TxIndexTest::ReadBestBlock(txindex).vHave == legacy_locator.vHave);

    TxIndexTest::WriteBestBlock(txindex, new_locator);
    BOOST_CHECK(TxIndexTest::ReadBestBlock(txindex).vHave == new_locator.vHave);

    CBlockLocator stored_legacy_locator;
    BOOST_REQUIRE(TxIndexTest::GetDB(txindex).Read(uint8_t{'B'}, stored_legacy_locator));
    BOOST_CHECK(stored_legacy_locator.vHave == legacy_locator.vHave);
}

</details>

in src/index/txindex.cpp:42 in ce62a0f5ea

  37 | @@ -30,6 +38,8 @@
  38 |  
  39 |  constexpr uint8_t DB_TXINDEX{'t'};
  40 |  const std::string DB_BEST_BLOCK_V2{"best_block_v2"};
  41 | +static const std::string DB_TXID_HASH_SALT{"txid_hash_salt"};
  42 | +static const std::string DB_NEXT_BLOCK_SEQ{"next_block_seq"};

l0rinc commented at 8:04 PM on July 21, 2026:

ce62a0f txindex: hash key prefixes and pack block positions:

nit: can be constexpr and should probably move to header, preferably inside existing txindex to be reusable in tests as well - and we could add LegacyTxKey helper as well:

<details><summary>centralize txindex format keys</summary>

diff --git a/src/index/txindex.cpp b/src/index/txindex.cpp
index f0ea651803..a8133bdadf 100644
--- a/src/index/txindex.cpp
+++ b/src/index/txindex.cpp
@@ -36,10 +36,7 @@
 #include <utility>
 #include <vector>
 
-constexpr uint8_t DB_TXINDEX{'t'};
 const std::string DB_BEST_BLOCK_V2{"best_block_v2"};
-static const std::string DB_TXID_HASH_SALT{"txid_hash_salt"};
-static const std::string DB_NEXT_BLOCK_SEQ{"next_block_seq"};
 
 std::unique_ptr<TxIndex> g_txindex;
 
@@ -75,17 +72,17 @@ static fs::path TxIndexDBPath() { return gArgs.GetDataDirNet() / "indexes" / "tx
 TxIndex::DB::DB(size_t n_cache_size, bool f_memory, bool f_wipe) :
     // Enable bloom filters only if legacy entries are present (they are point lookups)
     DB(n_cache_size, f_memory, f_wipe, /*f_obfuscate=*/false,
-       /*f_bloom=*/!f_memory && !f_wipe && CDBWrapper::HasKeyStartingWith(TxIndexDBPath(), DB_TXINDEX))
+       /*f_bloom=*/!f_memory && !f_wipe && CDBWrapper::HasKeyStartingWith(TxIndexDBPath(), txindex::DB_TXINDEX))
 {}
 
 TxIndex::DB::DB(size_t n_cache_size, bool f_memory, bool f_wipe, bool f_obfuscate, bool f_bloom) :
     BaseIndex::DB(TxIndexDBPath(), n_cache_size, f_memory, f_wipe, f_obfuscate, f_bloom),
     m_hasher{[](CDBWrapper& db) {
         std::pair<uint64_t, uint64_t> salt;
-        if (!db.Read(DB_TXID_HASH_SALT, salt)) {
+        if (!db.Read(txindex::DB_TXID_HASH_SALT, salt)) {
             FastRandomContext rng{};
             salt = {rng.rand64(), rng.rand64()};
-            db.Write(DB_TXID_HASH_SALT, salt, /*fSync=*/true);
+            db.Write(txindex::DB_TXID_HASH_SALT, salt, /*fSync=*/true);
         }
         return PresaltedSipHasher{salt.first, salt.second};
     }(*this)},
@@ -94,7 +91,7 @@ TxIndex::DB::DB(size_t n_cache_size, bool f_memory, bool f_wipe, bool f_obfuscat
 
 bool TxIndex::DB::ReadTxPos(const Txid& txid, CDiskTxPos& pos) const
 {
-    return Read(std::make_pair(DB_TXINDEX, txid.ToUint256()), pos);
+    return Read(txindex::LegacyTxKey(txid), pos);
 }
 
 CBlockLocator TxIndex::DB::ReadBestBlock() const
@@ -117,12 +114,12 @@ void TxIndex::DB::WriteTxs(const interfaces::BlockInfo& block)
     if (Exists(txindex::BlockHashKey{block.hash})) return;
 
     uint32_t block_seq{0};
-    Read(DB_NEXT_BLOCK_SEQ, block_seq);
+    Read(txindex::DB_NEXT_BLOCK_SEQ, block_seq);
 
     CDBBatch batch(*this);
     batch.Write(txindex::BlockHashKey{block.hash}, block_seq);
     batch.Write(txindex::BlockSeqKey{block_seq}, block.hash);
-    batch.Write(DB_NEXT_BLOCK_SEQ, block_seq + 1);
+    batch.Write(txindex::DB_NEXT_BLOCK_SEQ, block_seq + 1);
     uint32_t tx_offset_in_block{static_cast<uint32_t>(GetSerializeSize(CBlockHeader{})) + GetSizeOfCompactSize(block.data->vtx.size())};
     for (const auto& tx : block.data->vtx) {
         const txindex::DBKey key{txindex::CreateKeyPrefix(m_hasher, tx->GetHash()),
diff --git a/src/index/txindex_key.h b/src/index/txindex_key.h
index 8cdae5fb7b..bebe2e8c2f 100644
--- a/src/index/txindex_key.h
+++ b/src/index/txindex_key.h
@@ -17,11 +17,15 @@
 #include <cstring>
 #include <ios>
 #include <string>
+#include <utility>
 
 namespace txindex {
+constexpr uint8_t DB_TXINDEX{'t'};
 constexpr uint8_t DB_TXINDEX_HASHED{'x'};
 constexpr uint8_t DB_BLOCK_SEQ{'s'};
 constexpr uint8_t DB_BLOCK_HASH{'h'};
+inline constexpr std::string DB_TXID_HASH_SALT{"txid_hash_salt"};
+inline constexpr std::string DB_NEXT_BLOCK_SEQ{"next_block_seq"};
 
 //! The location of a transaction: the sequence number of the block that contains it
 //! and the transaction's serialized byte offset from the start of that block
@@ -69,6 +73,12 @@ struct BlockHashKey {
     }
 };
 
+//! Key of a legacy (pre-hashing) txindex row: the full txid under the 't' prefix.
+inline std::pair<uint8_t, uint256> LegacyTxKey(const Txid& txid)
+{
+    return {DB_TXINDEX, txid.ToUint256()};
+}
+
 using TxHashKeyPrefix = std::array<std::byte, 5>;
 
 inline TxHashKeyPrefix CreateKeyPrefix(const PresaltedSipHasher& hasher, const Txid& txid)
@@ -84,8 +94,6 @@ struct DBKey {
     TxHashKeyPrefix hash_prefix;
     BlockTxPosition pos;
 
-    explicit DBKey(const TxHashKeyPrefix& hash_in, const BlockTxPosition& pos_in) : hash_prefix{hash_in}, pos{pos_in} {}
-
     SERIALIZE_METHODS(DBKey, obj)
     {
         uint8_t prefix{DB_TXINDEX_HASHED};
diff --git a/src/test/txindex_tests.cpp b/src/test/txindex_tests.cpp
index 605332e8e7..4bad878264 100644
--- a/src/test/txindex_tests.cpp
+++ b/src/test/txindex_tests.cpp
@@ -48,7 +48,7 @@ namespace {
 PresaltedSipHasher ReadHasher(const CDBWrapper& db)
 {
     std::pair<uint64_t, uint64_t> salt;
-    BOOST_REQUIRE(db.Read(std::string{"txid_hash_salt"}, salt));
+    BOOST_REQUIRE(db.Read(txindex::DB_TXID_HASH_SALT, salt));
     return PresaltedSipHasher{salt.first, salt.second};
 }
 
@@ -191,7 +191,7 @@ BOOST_FIXTURE_TEST_CASE(txindex_collision_scan_path, TestChain100Setup)
     // the legacy CDiskTxPos.nTxOffset is measured after the header.
     const CDiskTxPos fake_physical{BlockFilePos(*m_node.chainman, fake_pos.block_seq + 1), fake_pos.tx_offset_in_block - static_cast<uint32_t>(GetSerializeSize(CBlockHeader{}))};
     db.Erase(txindex::DBKey{fake_prefix, fake_pos});
-    db.Write(std::make_pair(static_cast<uint8_t>('t'), fake_txid.ToUint256()), fake_physical);
+    db.Write(txindex::LegacyTxKey(fake_txid), fake_physical);
     CTransactionRef legacy_tx;
     BOOST_CHECK(!txindex.FindTx(fake_txid, block_hash, legacy_tx));
 
@@ -208,7 +208,7 @@ BOOST_FIXTURE_TEST_CASE(txindex_legacy_fallback, TestChain100Setup)
     const CDiskTxPos legacy_pos{BlockFilePos(*m_node.chainman, 1), 1};
     {
         CDBWrapper db{DBParams{.path = gArgs.GetDataDirNet() / "indexes" / "txindex", .cache_bytes = 1_MiB}};
-        db.Write(std::make_pair(static_cast<uint8_t>('t'), legacy_txid.ToUint256()), legacy_pos);
+        db.Write(txindex::LegacyTxKey(legacy_txid), legacy_pos);
     }
 
     TxIndex txindex(interfaces::MakeChain(m_node), 1_MiB, /*f_memory=*/false);

</details>

in src/index/txindex.cpp:197 in ce62a0f5ea

 200 | +            return true;
 201 | +        }
 202 | +    }
 203 | +
 204 | +    tx.reset();
 205 | +    // Fall back to legacy if no hashed entry matched. This makes misses pay an

l0rinc commented at 8:11 PM on July 21, 2026:

ce62a0f txindex: hash key prefixes and pack block positions:

As mentioned before, this method is huge and doesn't really help untangling new and old behavior. We have unpaired read and write now (where read cannot process what was produced by write) and two completely independent algorithms are stuck in the same method.

bool TxIndex::FindTx(const Txid& tx_hash, uint256& block_hash, CTransactionRef& tx) const
{
    if (auto result{FindHashedTx(tx_hash, block_hash, tx)}) return *result;

    // Fall back to legacy if no hashed entry matched. This makes misses pay an
    // extra lookup, but keeps existing full-txid entries readable after upgrade.
    return m_db->m_has_legacy && FindLegacyTx(tx_hash, block_hash, tx);
}

And ReadTxPos doesn't even seem to be needed anymore, we could just inline it instead:

    if (!m_db->Read(std::make_pair(DB_TXINDEX, tx_hash.ToUint256()), postx)) {

instead.

in src/index/txindex.cpp:69 in ce62a0f5ea

  68 |  };
  69 |  
  70 |  TxIndex::DB::DB(size_t n_cache_size, bool f_memory, bool f_wipe) :
  71 | -    BaseIndex::DB(gArgs.GetDataDirNet() / "indexes" / "txindex", n_cache_size, f_memory, f_wipe)
  72 | +    BaseIndex::DB(gArgs.GetDataDirNet() / "indexes" / "txindex", n_cache_size, f_memory, f_wipe),
  73 | +    m_hasher{[](CDBWrapper& db) {

l0rinc commented at 8:14 PM on July 21, 2026:

ce62a0f txindex: hash key prefixes and pack block positions:

This immediate lambda is a bit heavy here, can we extract to helper in local namespace?

namespace {
PresaltedSipHasher ReadOrCreateTxidHasher(CDBWrapper& db)
{
    std::pair<uint64_t, uint64_t> salt;
    if (!db.Read(DB_TXID_HASH_SALT, salt)) {
        FastRandomContext rng{};
        salt = {rng.rand64(), rng.rand64()};
        db.Write(DB_TXID_HASH_SALT, salt, /*fSync=*/true);
    }
    return PresaltedSipHasher{salt.first, salt.second};
}
} // namespace

and simplify construction to:

TxIndex::DB::DB(size_t n_cache_size, bool f_memory, bool f_wipe, bool has_legacy) :
    BaseIndex::DB(TxIndexDBPath(), n_cache_size, f_memory, f_wipe, /*f_obfuscate=*/false, /*f_bloom=*/has_legacy),
    m_hasher{ReadOrCreateTxidHasher(*this)},
    m_has_legacy{has_legacy}
{}

in src/index/txindex.cpp:102 in ce62a0f5ea outdated

  96 | @@ -75,11 +97,25 @@ void TxIndex::DB::WriteBestBlock(CDBBatch& batch, const CBlockLocator& locator)
  97 |      batch.Write(DB_BEST_BLOCK_V2, locator);
  98 |  }
  99 |  
 100 | -void TxIndex::DB::WriteTxs(const std::vector<std::pair<Txid, CDiskTxPos>>& v_pos)
 101 | +void TxIndex::DB::WriteTxs(const interfaces::BlockInfo& block)
 102 |  {
 103 | +    if (Exists(txindex::BlockHashKey{block.hash})) return;

l0rinc commented at 8:25 PM on July 21, 2026:

ce62a0f txindex: hash key prefixes and pack block positions:

What is this line meant to prevent exactly? Assigning multiple block sequences to the same hash? How can we even have that, manual disconnect/reconnect cycles? If so, this isn't trivial and could use a line comment.

    if (Exists(txindex::BlockHashKey{block.hash})) return; // Preserve its sequence when a block reconnects

in src/index/txindex.cpp:110 in ce62a0f5ea

 108 |      CDBBatch batch(*this);
 109 | -    for (const auto& [txid, pos] : v_pos) {
 110 | -        batch.Write(std::make_pair(DB_TXINDEX, txid.ToUint256()), pos);
 111 | +    batch.Write(txindex::BlockHashKey{block.hash}, block_seq);
 112 | +    batch.Write(txindex::BlockSeqKey{block_seq}, block.hash);
 113 | +    batch.Write(DB_NEXT_BLOCK_SEQ, block_seq + 1);

l0rinc commented at 8:28 PM on July 21, 2026:

ce62a0f txindex: hash key prefixes and pack block positions:

nit: we could assert that block_seq doesn't overflow here maybe with CheckedAdd. Just resolve if you disagree.

in src/index/txindex.cpp:111 in ce62a0f5ea

 109 | -    for (const auto& [txid, pos] : v_pos) {
 110 | -        batch.Write(std::make_pair(DB_TXINDEX, txid.ToUint256()), pos);
 111 | +    batch.Write(txindex::BlockHashKey{block.hash}, block_seq);
 112 | +    batch.Write(txindex::BlockSeqKey{block_seq}, block.hash);
 113 | +    batch.Write(DB_NEXT_BLOCK_SEQ, block_seq + 1);
 114 | +    uint32_t tx_offset_in_block{static_cast<uint32_t>(GetSerializeSize(CBlockHeader{})) + GetSizeOfCompactSize(block.data->vtx.size())};

l0rinc commented at 8:30 PM on July 21, 2026:

ce62a0f txindex: hash key prefixes and pack block positions:

still think GetSerializeSize(CBlockHeader{}) shouldn't be recomputed but we should add a constant for it - and use it in txindex_collision_scan_path tests

in src/index/txindex.cpp:100 in ce62a0f5ea outdated

  96 | @@ -75,11 +97,25 @@ void TxIndex::DB::WriteBestBlock(CDBBatch& batch, const CBlockLocator& locator)
  97 |      batch.Write(DB_BEST_BLOCK_V2, locator);
  98 |  }
  99 |  
 100 | -void TxIndex::DB::WriteTxs(const std::vector<std::pair<Txid, CDiskTxPos>>& v_pos)
 101 | +void TxIndex::DB::WriteTxs(const interfaces::BlockInfo& block)

l0rinc commented at 8:35 PM on July 21, 2026:

ce62a0f txindex: hash key prefixes and pack block positions:

To simplify review slightly, I'd do the void TxIndex::DB::WriteTxs(const std::vector<std::pair<Txid, CDiskTxPos>>& v_pos) to void TxIndex::DB::WriteTxs(const interfaces::BlockInfo& block) migration in a separate refactor commit.

in src/index/txindex.cpp:152 in ce62a0f5ea

 155 | +    std::vector<txindex::BlockTxPosition> positions;
 156 | +    for (; it->Valid() && it->GetKey(key) && key.hash_prefix == prefix; it->Next()) {
 157 | +        positions.emplace_back(key.pos);
 158 | +    }
 159 | +
 160 | +    // Lookup latest connected entries first.

l0rinc commented at 8:49 PM on July 21, 2026:

ce62a0f txindex: hash key prefixes and pack block positions:

Natural key order puts the oldest block sequence first, forcing lookup to collect and reverse positions before building and partitioning another candidate vector. Therefore in case of prefix collisions we collect every position and have to reverse to favor newer blocks.

Could we complement the stored sequence so LevelDB visits newer blocks first, scan the bucket directly, return an active-chain match immediately, and retain only the first stale match as fallback?

<details><summary>stream newest txindex candidates</summary>

diff --git a/src/index/txindex.cpp b/src/index/txindex.cpp
index 5a3775e96e..f6cba63f7a 100644
--- a/src/index/txindex.cpp
+++ b/src/index/txindex.cpp
@@ -25,7 +25,6 @@
 #include <util/log.h>
 #include <validation.h>
 
-#include <algorithm>
 #include <array>
 #include <cassert>
 #include <cstddef>
@@ -34,7 +33,6 @@
 #include <exception>
 #include <string>
 #include <utility>
-#include <vector>
 
 std::unique_ptr<TxIndex> g_txindex;
 
@@ -138,42 +136,30 @@ std::optional<bool> TxIndex::FindHashedTx(const Txid& tx_hash, uint256& block_ha
     std::unique_ptr<CDBIterator> it{m_db->NewIterator()};
     it->Seek(std::pair{txindex::DB_TXINDEX_HASHED, prefix});
     txindex::DBKey key{prefix, {}};
-    std::vector<txindex::BlockTxPosition> positions;
+    CTransactionRef stale_tx;
+    uint256 stale_block_hash;
     for (; it->Valid() && it->GetKey(key) && key.hash_prefix == prefix; it->Next()) {
-        positions.emplace_back(key.pos);
-    }
-
-    // Lookup latest connected entries first.
-    std::ranges::reverse(positions);
-
-    struct Candidate {
-        FlatFilePos tx_position;
-        uint256 block_hash;
-        bool in_active_chain;
-    };
-    std::vector<Candidate> candidates;
-    for (const auto& pos : positions) {
-        uint256 block_hash;
-        if (!m_db->Read(txindex::BlockSeqKey{pos.block_seq}, block_hash)) {
-            LogError("Block sequence %u not found", pos.block_seq);
+        uint256 seq_block_hash;
+        if (!m_db->Read(txindex::BlockSeqKey{key.pos.block_seq}, seq_block_hash)) {
+            LogError("Block sequence %u not found", key.pos.block_seq);
             return false;
         }
-        LOCK(cs_main);
-        const CBlockIndex* block_index{m_chainstate->m_blockman.LookupBlockIndex(block_hash)};
-        if (!block_index) {
-            LogError("Block index entry %s not found", block_hash.ToString());
-            return false;
+        FlatFilePos tx_position;
+        bool in_active_chain;
+        {
+            LOCK(cs_main);
+            const CBlockIndex* block_index{m_chainstate->m_blockman.LookupBlockIndex(seq_block_hash)};
+            if (!block_index) {
+                LogError("Block index entry %s not found", seq_block_hash.ToString());
+                return false;
+            }
+            if (!(block_index->nStatus & BLOCK_HAVE_DATA)) continue;
+            tx_position = {block_index->nFile, block_index->nDataPos + key.pos.tx_offset_in_block};
+            in_active_chain = m_chainstate->m_chain.Contains(*block_index);
         }
-        if (!(block_index->nStatus & BLOCK_HAVE_DATA)) continue;
-        const FlatFilePos tx_position{block_index->nFile, block_index->nDataPos + pos.tx_offset_in_block};
-        candidates.emplace_back(tx_position, block_hash, m_chainstate->m_chain.Contains(*block_index));
-    }
-
-    // Try candidates in the active chain first.
-    std::stable_partition(candidates.begin(), candidates.end(), [](const Candidate& c) { return c.in_active_chain; });
+        if (!in_active_chain && stale_tx) continue;
 
-    for (const auto& candidate : candidates) {
-        AutoFile file{m_chainstate->m_blockman.OpenBlockFile(candidate.tx_position, /*fReadOnly=*/true)};
+        AutoFile file{m_chainstate->m_blockman.OpenBlockFile(tx_position, /*fReadOnly=*/true)};
         if (file.IsNull()) {
             LogError("OpenBlockFile failed");
             return false;
@@ -185,11 +171,21 @@ std::optional<bool> TxIndex::FindHashedTx(const Txid& tx_hash, uint256& block_ha
             LogError("Deserialize or I/O error - %s", e.what());
             return false;
         }
-        if (candidate_tx->GetHash() == tx_hash) {
+        if (candidate_tx->GetHash() != tx_hash) continue;
+
+        if (in_active_chain) {
             tx = std::move(candidate_tx);
-            block_hash = candidate.block_hash;
+            block_hash = seq_block_hash;
             return true;
         }
+        stale_tx = std::move(candidate_tx);
+        stale_block_hash = seq_block_hash;
+    }
+
+    if (stale_tx) {
+        tx = std::move(stale_tx);
+        block_hash = stale_block_hash;
+        return true;
     }
     return std::nullopt;
 }
diff --git a/src/index/txindex_key.h b/src/index/txindex_key.h
index 866aef549b..ed8827eff8 100644
--- a/src/index/txindex_key.h
+++ b/src/index/txindex_key.h
@@ -5,6 +5,7 @@
 #ifndef BITCOIN_INDEX_TXINDEX_KEY_H
 #define BITCOIN_INDEX_TXINDEX_KEY_H
 
+#include <consensus/consensus.h>
 #include <crypto/common.h>
 #include <crypto/siphash.h>
 #include <primitives/transaction.h>
@@ -31,18 +32,25 @@ inline constexpr std::string DB_NEXT_BLOCK_SEQ{"next_block_seq"};
 //! (including the header), so the on-disk position is simply
 //! block_data_pos + tx_offset_in_block.
 //!
-//! Both values are serialized as 3-byte big-endian integers, so entries sort by
-//! (block, offset) and each value must be below 16,777,216.
+//! Both values are serialized as 3-byte big-endian integers. The block sequence
+//! is complemented so entries sort by descending sequence, then ascending offset.
+//! Each value must be below 16,777,216.
 struct BlockTxPosition {
     uint32_t block_seq{0};
     uint32_t tx_offset_in_block{0};
 
     friend bool operator==(const BlockTxPosition&, const BlockTxPosition&) = default;
 
+    static constexpr int BLOCK_SEQ_SIZE{3}, TX_OFFSET_SIZE{3};
+    static constexpr uint32_t BLOCK_SEQ_MASK{BigEndianFormatter<BLOCK_SEQ_SIZE>::MAX};
+    static_assert(MAX_BLOCK_SERIALIZED_SIZE <= BigEndianFormatter<TX_OFFSET_SIZE>::MAX);
+
     SERIALIZE_METHODS(BlockTxPosition, obj)
     {
-        READWRITE(Using<BigEndianFormatter<3>>(obj.block_seq),
-                  Using<BigEndianFormatter<3>>(obj.tx_offset_in_block));
+        uint32_t ordered_block_seq{obj.block_seq ^ BLOCK_SEQ_MASK};
+        READWRITE(Using<BigEndianFormatter<BLOCK_SEQ_SIZE>>(ordered_block_seq),
+                  Using<BigEndianFormatter<TX_OFFSET_SIZE>>(obj.tx_offset_in_block));
+        SER_READ(obj, obj.block_seq = ordered_block_seq ^ BLOCK_SEQ_MASK);
     }
 };
 
@@ -55,7 +63,7 @@ struct BlockSeqKey {
         uint8_t prefix{DB_BLOCK_SEQ};
         READWRITE(prefix);
         if (ser_action.ForRead() && prefix != DB_BLOCK_SEQ) throw std::ios_base::failure("Invalid format for txindex block seq key");
-        READWRITE(Using<BigEndianFormatter<4>>(obj.block_seq));
+        READWRITE(Using<BigEndianFormatter<BlockTxPosition::BLOCK_SEQ_SIZE>>(obj.block_seq));
     }
 };
 
diff --git a/src/test/txindex_tests.cpp b/src/test/txindex_tests.cpp
index dd60aee311..bf840b15ea 100644
--- a/src/test/txindex_tests.cpp
+++ b/src/test/txindex_tests.cpp
@@ -102,10 +102,10 @@ void InvalidateBlock(ChainstateManager& chainman, const uint256& block_hash)
 BOOST_AUTO_TEST_CASE(txindex_position_encoding)
 {
     constexpr struct { txindex::BlockTxPosition position; std::string_view encoded; } test_vectors[]{
-        {{0, 0}, "000000000000"},
-        {{1, 2}, "000001000002"},
-        {{10'000'000, 123}, "98968000007b"},
-        {{456, 3'999'999}, "0001c83d08ff"},
+        {{0, 0}, "ffffff000000"},
+        {{1, 2}, "fffffe000002"},
+        {{10'000'000, 123}, "67697f00007b"},
+        {{456, 3'999'999}, "fffe373d08ff"},
     };
 
     for (const auto& [position, encoded] : test_vectors) {
@@ -115,6 +115,9 @@ BOOST_AUTO_TEST_CASE(txindex_position_encoding)
         BOOST_CHECK((DataStream{ParseHex(encoded)} >> decoded).empty());
         BOOST_CHECK(decoded == position);
     }
+
+    BOOST_CHECK_EQUAL(HexStr(DataStream{} << txindex::BlockSeqKey{1}), "73000001");
+    BOOST_CHECK_EQUAL(HexStr(DataStream{} << txindex::DBKey{{std::byte{1}, std::byte{2}, std::byte{3}, std::byte{4}, std::byte{5}}, {1, 2}}), "780102030405fffffe000002");
 }
 
 BOOST_AUTO_TEST_CASE(txindex_hash_prefix)
@@ -217,12 +220,12 @@ BOOST_FIXTURE_TEST_CASE(txindex_collision_scan_path, TestChain100Setup)
 
     db.Write(txindex::DBKey{target_prefix, fake_pos}, "");
 
-    // The target's bucket now holds the real target first (lower sequence
-    // number), then the forged false positive, which the descending scan tries first.
+    // The target's bucket now holds the forged false positive first (higher
+    // sequence number), then the real target.
     const auto target_bucket{BucketPositions(db, target_prefix)};
     BOOST_REQUIRE_EQUAL(target_bucket.size(), 2U);
-    BOOST_CHECK(target_bucket[0] != fake_pos);
-    BOOST_CHECK(target_bucket[1] == fake_pos);
+    BOOST_CHECK(target_bucket[0] == fake_pos);
+    BOOST_CHECK(target_bucket[1] != fake_pos);
 
     uint256 block_hash;
     LookupTx(txindex, target_txid);

</details>

in src/index/txindex.cpp:170 in ce62a0f5ea

 173 | +        if (!it->Valid() || !it->GetKey(key) || key.block_seq != pos.block_seq || !it->GetValue(block_hash)) {
 174 | +            continue;
 175 | +        }
 176 | +        LOCK(cs_main);
 177 | +        const CBlockIndex* block_index{m_chainstate->m_blockman.LookupBlockIndex(block_hash)};
 178 | +        if (!block_index || !(block_index->nStatus & BLOCK_HAVE_DATA)) continue;

l0rinc commented at 8:55 PM on July 21, 2026:

ce62a0f txindex: hash key prefixes and pack block positions:

Same, not sure about this, I don't think we should treat non-existent matches and pruned matches the same way. Missing block index looks like an index consistency failure, while "no block data" is "pruned or unavailable". Even if today txindex isn't meant to run pruned, I'd split those branches and either fail loud or at least distinguish errors.

        if (!block_index) {
            LogError("Block index entry %s not found", seq_block_hash.ToString());
            return false;
        }
        if (!(block_index->nStatus & BLOCK_HAVE_DATA)) continue;

in src/index/txindex.cpp:188 in ce62a0f5ea

 191 | +        }
 192 | +        try {
 193 | +            file >> TX_WITH_WITNESS(tx);
 194 | +        } catch (const std::exception& e) {
 195 | +            LogError("Deserialize or I/O error - %s", e.what());
 196 | +            return false;

l0rinc commented at 9:00 PM on July 21, 2026:

ce62a0f txindex: hash key prefixes and pack block positions:

I can't comment on the above response for some reason, but I'd reset before returning false here - or just not mutate the param in the first place. The contract says tx is unspecified on false, but that's still fragile. Safer is to not mutate tx until after the full txid check passes, so we can harden the guarantees: read into a local candidate_tx, compare hash, and only then assign to tx.

in src/index/txindex_key.h:55 in ce62a0f5ea

  50 | +    SERIALIZE_METHODS(BlockSeqKey, obj)
  51 | +    {
  52 | +        uint8_t prefix{DB_BLOCK_SEQ};
  53 | +        READWRITE(prefix);
  54 | +        if (ser_action.ForRead() && prefix != DB_BLOCK_SEQ) throw std::ios_base::failure("Invalid format for txindex block seq key");
  55 | +        READWRITE(Using<BigEndianFormatter<4>>(obj.block_seq));

l0rinc commented at 9:04 PM on July 21, 2026:

ce62a0f txindex: hash key prefixes and pack block positions:

How come we're using 4 bytes here - compared to the 3 in BlockTxPosition? Could we unify and extract the constant? And maybe pin with some unit tests (assuming the new jumbo Siphash):

    BOOST_CHECK_EQUAL(HexStr(DataStream{} << txindex::BlockSeqKey{1}), "73000001");
    BOOST_CHECK_EQUAL(HexStr(DataStream{} << txindex::DBKey{{std::byte{1}, std::byte{2}, std::byte{3}, std::byte{4}, std::byte{5}}, {1, 2}}), "780102030405000001000002");
    BOOST_CHECK_EQUAL(
        HexStr(txindex::CreateKeyPrefix(
            SipHasher13UJ{0x0706050403020100ULL, 0x0F0E0D0C0B0A0908ULL},
            Txid{"1f1e1d1c1b1a191817161514131211100f0e0d0c0b0a09080706050403020100"})),
        "c67d87b08c");

in src/dbwrapper.cpp:165 in be3bf13c7a

 160 | +    leveldb::DB* raw_db;
 161 | +    if (!leveldb::DB::Open(options, fs::PathToString(path), &raw_db).ok()) return false;
 162 | +    const std::unique_ptr<leveldb::DB> db{raw_db};
 163 | +
 164 | +    const std::unique_ptr<leveldb::Iterator> it{db->NewIterator({})};
 165 | +    const leveldb::Slice prefix_slice{reinterpret_cast<const char*>(&prefix), 1};

l0rinc commented at 9:15 PM on July 21, 2026:

be3bf13 txindex: skip bloom filters and legacy lookups for new databases:

nit:

    const leveldb::Slice prefix_slice{reinterpret_cast<const char*>(&prefix), sizeof(prefix));

in src/dbwrapper.cpp:161 in be3bf13c7a outdated

 152 | @@ -153,6 +153,20 @@ static leveldb::Options GetOptions(size_t nCacheSize, bool bloom_filter)
 153 |      return options;
 154 |  }
 155 |  
 156 | +bool CDBWrapper::HasKeyStartingWith(const fs::path& path, uint8_t prefix)
 157 | +{
 158 | +    leveldb::Options options;
 159 | +    options.paranoid_checks = true;
 160 | +    leveldb::DB* raw_db;
 161 | +    if (!leveldb::DB::Open(options, fs::PathToString(path), &raw_db).ok()) return false;

l0rinc commented at 9:22 PM on July 21, 2026:

be3bf13 txindex: skip bloom filters and legacy lookups for new databases:

I understand that it's expected to have empty databases, but that shouldn't happen here. Can we rather check open similarly to https://github.com/bitcoin/bitcoin/blob/d673ca765a5a05b0f16b3b6d62047deda092da9b/src/dbwrapper.cpp#L253-L254

HandleError(leveldb::DB::Open(options, fs::PathToString(path), &raw_db));

and guard general db availability with:

    if (!fs::exists(path / "CURRENT")) return false; // LevelDB's database-existence marker

We should probably add Seek validation:

    it->Seek(prefix_slice);
    HandleError(it->status());

andrewtoth commented at 12:43 AM on July 26, 2026:

Not sure we need this. If db is corrupt, when we try and open it after peeking it will fail anyway? Or am I misunderstanding this?

l0rinc commented at 10:40 PM on July 26, 2026:

HandleError for Open and Seek would be safer against corruption - and more consistent with other similar usages. I don't mind if you don't want to do it here.

But the other part is related to a separate side effect of the earlier prefix check: leveldb::DB::Open() is not read-only:

on a missing path it creates the database directory before discovering there is no CURRENT file
on an existing database, its default logger creates or rotates LevelDB's LOG file

Demo (no need to commit):

BOOST_AUTO_TEST_CASE(dbwrapper_has_key_starting_with)
{
    const fs::path path{m_args.GetDataDirBase() / "dbwrapper_has_key_starting_with"};

    BOOST_CHECK(!CDBWrapper::HasKeyStartingWith(path, uint8_t{'t'}));
    BOOST_CHECK(!fs::exists(path)); // TODO
    { CDBWrapper{{.path = path, .cache_bytes = 1_MiB}}.Write(uint8_t{'t'}, uint8_t{1}); }
    BOOST_CHECK(CDBWrapper::HasKeyStartingWith(path, uint8_t{'t'}));
    BOOST_CHECK(!CDBWrapper::HasKeyStartingWith(path, uint8_t{'x'}));
    BOOST_CHECK(!fs::exists(path / "LOG")); // TODO
    BOOST_CHECK(!fs::exists(path / "LOG.old")); // TODO
}

The fix could be something like:

bool CDBWrapper::HasKeyStartingWith(const fs::path& path, uint8_t prefix)
{
    if (!fs::exists(path / "CURRENT")) return false;

    CBitcoinLevelDBLogger logger;
    leveldb::Options options;
    options.paranoid_checks = true;
    options.info_log = &logger;     // Avoid creating or rotating LevelDB's LOG files during this probe

    leveldb::DB* raw_db;
    HandleError(leveldb::DB::Open(options, fs::PathToString(path), &raw_db));
    std::unique_ptr<leveldb::DB> db{raw_db};

    leveldb::ReadOptions iteroptions;
    iteroptions.verify_checksums = true;
    iteroptions.fill_cache = false;

    std::unique_ptr<leveldb::Iterator> it{db->NewIterator(iteroptions)};
    leveldb::Slice prefix_slice{reinterpret_cast<const char*>(&prefix), sizeof(prefix)};
    it->Seek(prefix_slice);
    HandleError(it->status());

    return it->Valid() && it->key().starts_with(prefix_slice);
}

in src/test/txindex_tests.cpp:184 in 693dfa6adf

 179 | +
 180 | +    CTransactionRef tx_disk;
 181 | +    uint256 block_hash;
 182 | +    BOOST_REQUIRE(txindex.FindTx(target_txid, block_hash, tx_disk));
 183 | +    BOOST_REQUIRE(tx_disk);
 184 | +    BOOST_CHECK(tx_disk->GetHash() == target_txid);

l0rinc commented at 9:33 PM on July 21, 2026:

693dfa6 tests: cover txindex hash prefix collisions and legacy fallback:

The tests are quite a mouthful, maybe we can simplify them a bit with stuff like:

    BOOST_CHECK(Assert(tx_disk)->GetHash() == target_txid);

(note that we weren't requiring this after every FindTx call, this would also unify that)

<details><summary>simplify txindex result check</summary>

diff --git a/src/test/txindex_tests.cpp b/src/test/txindex_tests.cpp
index f1728b8d54..ec54524968 100644
--- a/src/test/txindex_tests.cpp
+++ b/src/test/txindex_tests.cpp
@@ -180,8 +180,7 @@ BOOST_FIXTURE_TEST_CASE(txindex_collision_scan_path, TestChain100Setup)
     CTransactionRef tx_disk;
     uint256 block_hash;
     BOOST_REQUIRE(txindex.FindTx(target_txid, block_hash, tx_disk));
-    BOOST_REQUIRE(tx_disk);
-    BOOST_CHECK(tx_disk->GetHash() == target_txid);
+    BOOST_CHECK(Assert(tx_disk)->GetHash() == target_txid);
 
     // A database created fresh by this version cannot contain legacy entries, so
     // lookups skip the legacy fallback: drop the last coinbase's hashed entry and
@@ -226,8 +225,7 @@ BOOST_FIXTURE_TEST_CASE(txindex_legacy_fallback, TestChain100Setup)
     CTransactionRef tx_disk;
     uint256 block_hash;
     BOOST_REQUIRE(txindex.FindTx(legacy_txid, block_hash, tx_disk));
-    BOOST_REQUIRE(tx_disk);
-    BOOST_CHECK(tx_disk->GetHash() == legacy_txid);
+    BOOST_CHECK(Assert(tx_disk)->GetHash() == legacy_txid);
 
     txindex.Stop();
 }
@@ -256,7 +254,7 @@ BOOST_FIXTURE_TEST_CASE(txindex_reorg_keeps_stale_entries, TestChain100Setup)
     CTransactionRef tx_disk;
     uint256 block_hash;
     BOOST_REQUIRE(txindex.FindTx(unique_txid, block_hash, tx_disk));
-    BOOST_CHECK(tx_disk->GetHash() == unique_txid);
+    BOOST_CHECK(Assert(tx_disk)->GetHash() == unique_txid);
     BOOST_CHECK(block_hash == stale_block_hash);
 
     CDBWrapper& db{TxIndexTest::GetDB(txindex)};
@@ -278,7 +276,7 @@ BOOST_FIXTURE_TEST_CASE(txindex_reorg_keeps_stale_entries, TestChain100Setup)
 
     // The disconnected transaction is still found, in the now-stale block.
     BOOST_REQUIRE(txindex.FindTx(unique_txid, block_hash, tx_disk));
-    BOOST_CHECK(tx_disk->GetHash() == unique_txid);
+    BOOST_CHECK(Assert(tx_disk)->GetHash() == unique_txid);
     BOOST_CHECK(block_hash == stale_block_hash);
     {
         LOCK(cs_main);
@@ -306,7 +304,7 @@ BOOST_FIXTURE_TEST_CASE(txindex_reorg_keeps_stale_entries, TestChain100Setup)
     // The transaction is found in the reconnected (again active) block, and its
     // bucket keeps the original position.
     BOOST_REQUIRE(txindex.FindTx(unique_txid, block_hash, tx_disk));
-    BOOST_CHECK(tx_disk->GetHash() == unique_txid);
+    BOOST_CHECK(Assert(tx_disk)->GetHash() == unique_txid);
     BOOST_CHECK(block_hash == stale_block_hash);
 
     BOOST_CHECK(BucketPositions(db, prefix) == original_bucket);

</details>

We could also extract some commonly used primitived here like looking up a tx and invalidating a block.

<details><summary>share txindex test helpers</summary>

diff --git a/src/test/txindex_tests.cpp b/src/test/txindex_tests.cpp
index 213e0da54e..5965e3c72f 100644
--- a/src/test/txindex_tests.cpp
+++ b/src/test/txindex_tests.cpp
@@ -23,6 +23,7 @@
 #include <sync.h>
 #include <test/util/setup_common.h>
 #include <util/byte_units.h>
+#include <util/check.h>
 #include <util/strencodings.h>
 #include <validation.h>
 
@@ -79,6 +80,23 @@ FlatFilePos BlockFilePos(const ChainstateManager& chainman, uint32_t height)
     return {block_index->nFile, block_index->nDataPos};
 }
 
+uint256 LookupTx(const TxIndex& txindex, const Txid& txid)
+{
+    CTransactionRef tx;
+    uint256 block_hash;
+    BOOST_REQUIRE(txindex.FindTx(txid, block_hash, tx));
+    BOOST_CHECK(Assert(tx)->GetHash() == txid);
+    return block_hash;
+}
+
+void InvalidateBlock(ChainstateManager& chainman, const uint256& block_hash)
+{
+    CBlockIndex* block_index{WITH_LOCK(cs_main, return chainman.m_blockman.LookupBlockIndex(block_hash))};
+    BOOST_REQUIRE(block_index);
+    BlockValidationState state;
+    BOOST_REQUIRE(chainman.ActiveChainstate().InvalidateBlock(state, block_index));
+}
+
 } // namespace
 
 BOOST_AUTO_TEST_CASE(txindex_position_encoding)
@@ -137,11 +155,7 @@ BOOST_FIXTURE_TEST_CASE(txindex_initial_sync, TestChain100Setup)
 
     // Check that txindex has all txs that were in the chain before it started.
     for (const auto& txn : m_coinbase_txns) {
-        if (!txindex.FindTx(txn->GetHash(), block_hash, tx_disk)) {
-            BOOST_ERROR("FindTx failed");
-        } else if (tx_disk->GetHash() != txn->GetHash()) {
-            BOOST_ERROR("Read incorrect tx");
-        }
+        LookupTx(txindex, txn->GetHash());
     }
 
     // Check that new transactions in new blocks make it into the index.
@@ -152,11 +166,7 @@ BOOST_FIXTURE_TEST_CASE(txindex_initial_sync, TestChain100Setup)
         const CTransaction& txn = *block.vtx[0];
 
         BOOST_CHECK(txindex.BlockUntilSyncedToCurrentChain());
-        if (!txindex.FindTx(txn.GetHash(), block_hash, tx_disk)) {
-            BOOST_ERROR("FindTx failed");
-        } else if (tx_disk->GetHash() != txn.GetHash()) {
-            BOOST_ERROR("Read incorrect tx");
-        }
+        LookupTx(txindex, txn.GetHash());
     }
 
     // shutdown sequence (c.f. Shutdown() in init.cpp)
@@ -217,11 +227,8 @@ BOOST_FIXTURE_TEST_CASE(txindex_collision_scan_path, TestChain100Setup)
     BOOST_CHECK(target_bucket[0] == fake_pos);
     BOOST_CHECK(target_bucket[1] != fake_pos);
 
-    CTransactionRef tx_disk;
     uint256 block_hash;
-    BOOST_REQUIRE(txindex.FindTx(target_txid, block_hash, tx_disk));
-    BOOST_REQUIRE(tx_disk);
-    BOOST_CHECK(tx_disk->GetHash() == target_txid);
+    LookupTx(txindex, target_txid);
 
     // A database created fresh by this version cannot contain legacy entries, so
     // lookups skip the legacy fallback: drop the last coinbase's hashed entry and
@@ -263,11 +270,7 @@ BOOST_FIXTURE_TEST_CASE(txindex_legacy_fallback, TestChain100Setup)
     BOOST_REQUIRE(!bucket.empty());
     for (const auto& pos : bucket) db.Erase(txindex::DBKey{prefix, pos});
 
-    CTransactionRef tx_disk;
-    uint256 block_hash;
-    BOOST_REQUIRE(txindex.FindTx(legacy_txid, block_hash, tx_disk));
-    BOOST_REQUIRE(tx_disk);
-    BOOST_CHECK(tx_disk->GetHash() == legacy_txid);
+    LookupTx(txindex, legacy_txid);
 
     txindex.Stop();
 }
@@ -293,11 +296,7 @@ BOOST_FIXTURE_TEST_CASE(txindex_reorg_keeps_stale_entries, TestChain100Setup)
     const uint256 stale_block_hash{CreateAndProcessBlock({unique_mtx}, coinbase_script).GetHash()};
     BOOST_REQUIRE(txindex.BlockUntilSyncedToCurrentChain());
 
-    CTransactionRef tx_disk;
-    uint256 block_hash;
-    BOOST_REQUIRE(txindex.FindTx(unique_txid, block_hash, tx_disk));
-    BOOST_CHECK(tx_disk->GetHash() == unique_txid);
-    BOOST_CHECK(block_hash == stale_block_hash);
+    BOOST_CHECK(LookupTx(txindex, unique_txid) == stale_block_hash);
 
     CDBWrapper& db{TxIndexTest::GetDB(txindex)};
     const auto prefix{txindex::CreateKeyPrefix(ReadHasher(db), unique_txid)};
@@ -307,22 +306,16 @@ BOOST_FIXTURE_TEST_CASE(txindex_reorg_keeps_stale_entries, TestChain100Setup)
     ChainstateManager& chainman{*m_node.chainman};
 
     // Invalidate the block holding the unique transaction, then mine a longer branch.
-    {
-        CBlockIndex* tip{WITH_LOCK(cs_main, return chainman.ActiveChain().Tip())};
-        BlockValidationState state;
-        BOOST_REQUIRE(chainman.ActiveChainstate().InvalidateBlock(state, tip));
-    }
+    InvalidateBlock(chainman, stale_block_hash);
     const uint256 branch_block_hash{CreateAndProcessBlock({}, coinbase_script).GetHash()};
     CreateAndProcessBlock({}, coinbase_script);
     BOOST_REQUIRE(txindex.BlockUntilSyncedToCurrentChain());
 
     // The disconnected transaction is still found, in the now-stale block.
-    BOOST_REQUIRE(txindex.FindTx(unique_txid, block_hash, tx_disk));
-    BOOST_CHECK(tx_disk->GetHash() == unique_txid);
-    BOOST_CHECK(block_hash == stale_block_hash);
+    BOOST_CHECK(LookupTx(txindex, unique_txid) == stale_block_hash);
     {
         LOCK(cs_main);
-        const CBlockIndex* stale_index{chainman.m_blockman.LookupBlockIndex(block_hash)};
+        const CBlockIndex* stale_index{chainman.m_blockman.LookupBlockIndex(stale_block_hash)};
         BOOST_REQUIRE(stale_index);
         BOOST_CHECK(!chainman.ActiveChain().Contains(*stale_index));
     }
@@ -334,20 +327,15 @@ BOOST_FIXTURE_TEST_CASE(txindex_reorg_keeps_stale_entries, TestChain100Setup)
         LOCK(cs_main);
         chainman.ActiveChainstate().ResetBlockFailureFlags(chainman.m_blockman.LookupBlockIndex(stale_block_hash));
     }
-    {
-        CBlockIndex* branch_index{WITH_LOCK(cs_main, return chainman.m_blockman.LookupBlockIndex(branch_block_hash))};
-        BlockValidationState state;
-        BOOST_REQUIRE(chainman.ActiveChainstate().InvalidateBlock(state, branch_index));
-        BOOST_REQUIRE(chainman.ActiveChainstate().ActivateBestChain(state));
-    }
+    InvalidateBlock(chainman, branch_block_hash);
+    BlockValidationState state;
+    BOOST_REQUIRE(chainman.ActiveChainstate().ActivateBestChain(state));
     BOOST_REQUIRE(txindex.BlockUntilSyncedToCurrentChain());
     BOOST_CHECK(WITH_LOCK(cs_main, return chainman.ActiveChain().Tip()->GetBlockHash()) == stale_block_hash);
 
     // The transaction is found in the reconnected (again active) block, and its
     // bucket keeps the original position.
-    BOOST_REQUIRE(txindex.FindTx(unique_txid, block_hash, tx_disk));
-    BOOST_CHECK(tx_disk->GetHash() == unique_txid);
-    BOOST_CHECK(block_hash == stale_block_hash);
+    BOOST_CHECK(LookupTx(txindex, unique_txid) == stale_block_hash);
 
     BOOST_CHECK(BucketPositions(db, prefix) == original_bucket);

</details>

in src/test/txindex_tests.cpp:235 in 693dfa6adf outdated

 230 | +    BOOST_CHECK(tx_disk->GetHash() == legacy_txid);
 231 | +
 232 | +    txindex.Stop();
 233 | +}
 234 | +
 235 | +BOOST_FIXTURE_TEST_CASE(txindex_reorg_keeps_stale_entries, TestChain100Setup)

l0rinc commented at 9:39 PM on July 21, 2026:

693dfa6 tests: cover txindex hash prefix collisions and legacy fallback:

What happens when a legacy txindex entry gets reorged to the new version? My understanding is that after an upgrade, reconnecting a legacy-only block will add v2 entries and retain both formats for the same txid. Maybe we can avoid that...

<details><summary>test legacy txindex reconnects</summary>

BOOST_FIXTURE_TEST_CASE(txindex_legacy_reconnect, TestChain100Setup)
{
    ChainstateManager& chainman{*m_node.chainman};
    const uint256 block_hash{WITH_LOCK(cs_main, return chainman.ActiveChain().Tip()->GetBlockHash())};
    const Txid legacy_txid{m_coinbase_txns.back()->GetHash()};
    const CDiskTxPos legacy_pos{BlockFilePos(chainman, 100), 1};
    {
        CDBWrapper db{DBParams{.path = gArgs.GetDataDirNet() / "indexes" / "txindex", .cache_bytes = 1_MiB}};
        db.Write(txindex::LegacyTxKey(legacy_txid), legacy_pos);
        db.Write(uint8_t{'B'}, CBlockLocator{{block_hash}});
    }

    TxIndex txindex(interfaces::MakeChain(m_node), /*n_cache_size=*/1_MiB, /*f_memory=*/false);
    BOOST_REQUIRE(txindex.Init());
    txindex.Sync();

    CDBWrapper& db{TxIndexTest::GetDB(txindex)};
    const auto prefix{txindex::CreateKeyPrefix(ReadHasher(db), legacy_txid)};
    BOOST_CHECK(BucketPositions(db, prefix).empty());
    BOOST_CHECK(LookupTx(txindex, legacy_txid) == block_hash);

    InvalidateBlock(chainman, block_hash);
    BOOST_REQUIRE(txindex.BlockUntilSyncedToCurrentChain());
    {
        LOCK(cs_main);
        chainman.ActiveChainstate().ResetBlockFailureFlags(chainman.m_blockman.LookupBlockIndex(block_hash));
        chainman.RecalculateBestHeader();
    }
    BlockValidationState state;
    BOOST_REQUIRE(chainman.ActiveChainstate().ActivateBestChain(state));
    m_node.validation_signals->SyncWithValidationInterfaceQueue();
    BOOST_REQUIRE(txindex.BlockUntilSyncedToCurrentChain());

    BOOST_CHECK(WITH_LOCK(cs_main, return chainman.ActiveChain().Tip()->GetBlockHash()) == block_hash);
    BOOST_CHECK_EQUAL(BucketPositions(db, prefix).size(), 1U);
    BOOST_CHECK(db.Exists(txindex::LegacyTxKey(legacy_txid)));
    BOOST_CHECK(LookupTx(txindex, legacy_txid) == block_hash);

    txindex.Stop();
}

</details>

andrewtoth commented at 12:36 AM on July 26, 2026:

Not sure it's worth the review burden to add a test for this or try to avoid the double writes. This seems to me like a very unlikely edge case, and it will just cause the db to end up with an extra block worth of indexes.

in doc/release-notes-35531.md:4 in 20a8aaef4e

   0 | @@ -0,0 +1,9 @@
   1 | +## Index
   2 | +
   3 | +- The transaction index (`-txindex`) now stores less data on disk, more than
   4 | +  halving the size of a fully rebuilt index. The index is backwards compatible,

l0rinc commented at 9:45 PM on July 21, 2026:

20a8aae doc: add release notes for txindex disk usage and stale block lookups:

more than halving sounds a bit awkward, since the new size is actually "less than half" of the original.

A fully rebuilt transaction index (-txindex) now uses less than half as much disk space. The index is backwards compatible,

--

Also, PR description states: direct file position of the transaction but it's actually storing the block's offset, not the blockfile offset.

in src/index/txindex.cpp:77 in 20a8aaef4e

  76 | +static fs::path TxIndexDBPath() { return gArgs.GetDataDirNet() / "indexes" / "txindex"; }
  77 | +
  78 |  TxIndex::DB::DB(size_t n_cache_size, bool f_memory, bool f_wipe) :
  79 | -    BaseIndex::DB(gArgs.GetDataDirNet() / "indexes" / "txindex", n_cache_size, f_memory, f_wipe)
  80 | +    // Enable bloom filters only if legacy entries are present (they are point lookups)
  81 | +    DB(n_cache_size, f_memory, f_wipe, /*f_obfuscate=*/false,

l0rinc commented at 9:54 PM on July 21, 2026:

nit: having bloom be defined by whether it's an in-memory is a bit confusing

Since we're not actually using f_obfuscate, we could add has_legacy instead of bool f_obfuscate, bool f_bloom to make it higher level

in src/index/txindex.cpp:197 in 20a8aaef4e

 200 | +
 201 | +    for (const auto& candidate : candidates) {
 202 | +        AutoFile file{m_chainstate->m_blockman.OpenBlockFile(candidate.tx_position, /*fReadOnly=*/true)};
 203 | +        if (file.IsNull()) {
 204 | +            LogError("OpenBlockFile failed");
 205 | +            return false;

l0rinc commented at 10:06 PM on July 21, 2026:

What if a false-positive hash-prefix candidate is still marked BLOCK_HAVE_DATA, but its file cannot be opened or deserialized? We currently return immediately instead of trying later candidates or the legacy fallback. Could we continue after logging and add a test with an unreadable false positive before the real candidate? My understanding is this isn't currently the case for pruned blocks, so I'm not sure I fully understand the checks in the first place...

<details><summary>continue txindex candidate scans</summary>

diff --git a/src/index/txindex.cpp b/src/index/txindex.cpp
--- a/src/index/txindex.cpp	(revision 1792525e1bddb276baccd0f233012057133b03ef)
+++ b/src/index/txindex.cpp	(revision f4cca85dd92ef5e4c7875a4954b6cf70c2f8d6d8)
@@ -161,15 +161,15 @@
 
         AutoFile file{m_chainstate->m_blockman.OpenBlockFile(tx_position, /*fReadOnly=*/true)};
         if (file.IsNull()) {
-            LogError("OpenBlockFile failed");
-            return false;
+            LogWarning("OpenBlockFile failed");
+            continue;
         }
         CTransactionRef candidate_tx;
         try {
             file >> TX_WITH_WITNESS(candidate_tx);
         } catch (const std::exception& e) {
-            LogError("Deserialize or I/O error - %s", e.what());
-            return false;
+            LogWarning("Deserialize or I/O error - %s", e.what());
+            continue;
         }
         if (candidate_tx->GetHash() != tx_hash) continue;
 
diff --git a/src/test/txindex_tests.cpp b/src/test/txindex_tests.cpp
--- a/src/test/txindex_tests.cpp	(revision 1792525e1bddb276baccd0f233012057133b03ef)
+++ b/src/test/txindex_tests.cpp	(revision f4cca85dd92ef5e4c7875a4954b6cf70c2f8d6d8)
@@ -27,6 +27,7 @@
 #include <util/strencodings.h>
 #include <validation.h>
 
+#include <array>
 #include <cstdint>
 #include <string>
 #include <string_view>
@@ -217,15 +218,21 @@
     const auto fake_bucket{BucketPositions(db, fake_prefix)};
     BOOST_REQUIRE_EQUAL(fake_bucket.size(), 1U);
     const txindex::BlockTxPosition fake_pos{fake_bucket.front()};
+    const txindex::BlockTxPosition unreadable_pos{
+        fake_pos.block_seq,
+        static_cast<uint32_t>(BigEndianFormatter<txindex::BlockTxPosition::TX_OFFSET_SIZE>::MAX),
+    };
 
-    db.Write(txindex::DBKey{target_prefix, fake_pos}, "");
+    db.Write(txindex::DBKey{target_prefix, fake_pos}, std::array<std::byte, 0>{});
+    db.Write(txindex::DBKey{target_prefix, unreadable_pos}, std::array<std::byte, 0>{});
 
-    // The target's bucket now holds the forged false positive first (higher
+    // The target's bucket now holds the forged false positives first (higher
     // sequence number), then the real target.
     const auto target_bucket{BucketPositions(db, target_prefix)};
-    BOOST_REQUIRE_EQUAL(target_bucket.size(), 2U);
+    BOOST_REQUIRE_EQUAL(target_bucket.size(), 3U);
     BOOST_CHECK(target_bucket[0] == fake_pos);
-    BOOST_CHECK(target_bucket[1] != fake_pos);
+    BOOST_CHECK(target_bucket[1] == unreadable_pos);
+    BOOST_CHECK(target_bucket[2] != fake_pos);
 
     LookupTx(txindex, target_txid);

</details>

in src/dbwrapper.cpp:164 in 20a8aaef4e

 159 | +    options.paranoid_checks = true;
 160 | +    leveldb::DB* raw_db;
 161 | +    if (!leveldb::DB::Open(options, fs::PathToString(path), &raw_db).ok()) return false;
 162 | +    const std::unique_ptr<leveldb::DB> db{raw_db};
 163 | +
 164 | +    const std::unique_ptr<leveldb::Iterator> it{db->NewIterator({})};

l0rinc commented at 10:09 PM on July 21, 2026:

do we also need verify_checksums = true here? maybe with fill_cache = false?

in src/index/txindex.cpp:168 in 20a8aaef4e

 171 | +    for (; it->Valid() && it->GetKey(key) && key.hash_prefix == prefix; it->Next()) {
 172 | +        positions.emplace_back(key.pos);
 173 | +    }
 174 | +
 175 | +    // Lookup latest connected entries first.
 176 | +    std::ranges::reverse(positions);

l0rinc commented at 10:20 PM on July 21, 2026:

if you decide to keep the positions, we could avoid the std::ranges::reverse by adding the uint32_t block_seq to the candidate and sorting instead of the partition:

<details><summary>make txindex candidate order explicit</summary>

diff --git a/src/index/txindex.cpp b/src/index/txindex.cpp
--- a/src/index/txindex.cpp	(revision 04d233bdda39c231d994686e40064b3df5b5e58d)
+++ b/src/index/txindex.cpp	(revision daf1554f7e058d42d33f1227b3c541fb5dfa561a)
@@ -32,6 +32,7 @@
 #include <cstdint>
 #include <cstdio>
 #include <exception>
+#include <functional>
 #include <string>
 #include <utility>
 #include <vector>
@@ -158,12 +159,10 @@
         }
     }

-    // Lookup latest connected entries first.
-    std::ranges::reverse(positions);
-
     struct Candidate {
         FlatFilePos tx_position;
         uint256 block_hash;
+        uint32_t block_seq;
         bool in_active_chain;
     };
     std::vector<Candidate> candidates;
@@ -181,11 +180,13 @@
         }
         if (!(block_index->nStatus & BLOCK_HAVE_DATA)) continue;
         const FlatFilePos tx_position{block_index->nFile, block_index->nDataPos + pos.tx_offset_in_block};
-        candidates.emplace_back(tx_position, seq_block_hash, m_chainstate->m_chain.Contains(*block_index));
+        candidates.emplace_back(tx_position, seq_block_hash, pos.block_seq, m_chainstate->m_chain.Contains(*block_index));
     }

-    // Try candidates in the active chain first.
-    std::stable_partition(candidates.begin(), candidates.end(), [](const Candidate& c) { return c.in_active_chain; });
+    // Prefer active-chain matches, then later-connected blocks.
+    std::ranges::sort(candidates, std::greater{}, [](const Candidate& candidate) {
+        return std::pair{candidate.in_active_chain, candidate.block_seq};
+    });

     for (const auto& candidate : candidates) {
         AutoFile file{m_chainstate->m_blockman.OpenBlockFile(candidate.tx_position, /*fReadOnly=*/true)};

</details>

This is an alternative to storing the inverse of the seq.

achow101 referenced this in commit 32eb521002 on Jul 21, 2026

in src/index/txindex.cpp:61 in 20a8aaef4e

  59 | -    void WriteTxs(const std::vector<std::pair<Txid, CDiskTxPos>>& v_pos);
  60 | +    /// Write a block of transaction positions to the DB.
  61 | +    void WriteTxs(const interfaces::BlockInfo& block);
  62 | +
  63 | +    /// Used to hash the txid to compute the prefix.
  64 | +    const PresaltedSipHasher m_hasher;

l0rinc commented at 10:45 PM on July 21, 2026:

Now that #35215 is merged we can upgrade this one as well \:D/

in src/index/txindex.cpp:168 in ce62a0f5ea outdated

 171 | +        txindex::BlockSeqKey key{};
 172 | +        uint256 block_hash;
 173 | +        if (!it->Valid() || !it->GetKey(key) || key.block_seq != pos.block_seq || !it->GetValue(block_hash)) {
 174 | +            continue;
 175 | +        }
 176 | +        LOCK(cs_main);

l0rinc commented at 11:11 PM on July 21, 2026:

ce62a0f txindex: hash key prefixes and pack block positions:

Could we resolve all sequence mappings first, then snapshot candidate state under one lock? This avoids repeated lock acquisitions and ensures active-chain classification comes from one chain state:

<details><summary>snapshot txindex candidate state</summary>

diff --git a/src/index/txindex.cpp b/src/index/txindex.cpp
--- a/src/index/txindex.cpp	(revision df6dbad05c8f39b858c8ba60116cc18af4d32814)
+++ b/src/index/txindex.cpp	(revision 506e6b88e96625b90bfdd5523ceb96efba212d14)
@@ -165,22 +165,29 @@
         uint32_t block_seq;
         bool in_active_chain;
     };
-    std::vector<Candidate> candidates;
+    std::vector<std::pair<txindex::BlockTxPosition, uint256>> mapped_positions;
     for (const auto& pos : positions) {
         uint256 seq_block_hash;
         if (!m_db->Read(txindex::BlockSeqKey{pos.block_seq}, seq_block_hash)) {
             LogError("Block sequence %u not found", pos.block_seq);
             return false;
         }
+        mapped_positions.emplace_back(pos, seq_block_hash);
+    }
+
+    std::vector<Candidate> candidates;
+    {
         LOCK(cs_main);
-        const CBlockIndex* block_index{m_chainstate->m_blockman.LookupBlockIndex(seq_block_hash)};
-        if (!block_index) {
-            LogError("Block index entry %s not found", seq_block_hash.ToString());
-            return false;
-        }
-        if (!(block_index->nStatus & BLOCK_HAVE_DATA)) continue;
-        const FlatFilePos tx_position{block_index->nFile, block_index->nDataPos + pos.tx_offset_in_block};
-        candidates.emplace_back(tx_position, seq_block_hash, pos.block_seq, m_chainstate->m_chain.Contains(*block_index));
+        for (auto& [pos, seq_block_hash] : mapped_positions) {
+            const CBlockIndex* block_index{m_chainstate->m_blockman.LookupBlockIndex(seq_block_hash)};
+            if (!block_index) {
+                LogError("Block index entry %s not found for txid %s", seq_block_hash.ToString(), tx_hash.ToString());
+                return false;
+            }
+            if (!(block_index->nStatus & BLOCK_HAVE_DATA)) continue;
+            const FlatFilePos tx_position{block_index->nFile, block_index->nDataPos + pos.tx_offset_in_block};
+            candidates.emplace_back(tx_position, seq_block_hash, pos.block_seq, m_chainstate->m_chain.Contains(*block_index));
+        }
     }
 
     // Prefer active-chain matches, then later-connected blocks.

</details>

in src/test/txindex_tests.cpp:269 in 693dfa6adf

 264 | +    const auto original_bucket{BucketPositions(db, prefix)};
 265 | +    BOOST_REQUIRE_EQUAL(original_bucket.size(), 1U);
 266 | +
 267 | +    ChainstateManager& chainman{*m_node.chainman};
 268 | +
 269 | +    // Invalidate the block holding the unique transaction, then mine a longer branch.

l0rinc commented at 1:13 AM on July 22, 2026:

A tx can reference multiple blocks, and sequence order can disagree with active-chain preference after a reorg.

nit: could we cover selecting a later active occurrence, then the earlier occurrence after it becomes active again?

<details><summary>test txindex candidate priority</summary>

diff --git a/src/test/txindex_tests.cpp b/src/test/txindex_tests.cpp
--- a/src/test/txindex_tests.cpp	(revision 8af5ac219a5fc649a490e7d5ca3d4a2844ca745c)
+++ b/src/test/txindex_tests.cpp	(revision 6dcaee0d89fe6ffc0ac85b97896b161c0fc0461c)
@@ -305,10 +305,8 @@
 
     ChainstateManager& chainman{*m_node.chainman};
 
-    // Invalidate the block holding the unique transaction, then mine a longer branch.
+    // Invalidate the block holding the unique transaction.
     InvalidateBlock(chainman, stale_block_hash);
-    const uint256 branch_block_hash{CreateAndProcessBlock({}, coinbase_script).GetHash()};
-    CreateAndProcessBlock({}, coinbase_script);
     BOOST_REQUIRE(txindex.BlockUntilSyncedToCurrentChain());
 
     // The disconnected transaction is still found, in the now-stale block.
@@ -320,9 +318,15 @@
         BOOST_CHECK(!chainman.ActiveChain().Contains(*stale_index));
     }
 
+    // Mine the same transaction in a later-sequenced replacement branch.
+    const uint256 branch_block_hash{CreateAndProcessBlock({unique_mtx}, CScript() << OP_TRUE).GetHash()};
+    CreateAndProcessBlock({}, coinbase_script);
+    BOOST_REQUIRE(txindex.BlockUntilSyncedToCurrentChain());
+    BOOST_CHECK(LookupTx(txindex, unique_txid) == branch_block_hash);
+
     // Reorg back to the original branch by reconsidering the stale block and
-    // invalidating the replacement branch. Reconnecting the block must not
-    // create duplicate entries, since it keeps its original sequence number.
+    // invalidating the replacement branch. The active block must be preferred
+    // even though the replacement has a later sequence.
     {
         LOCK(cs_main);
         chainman.ActiveChainstate().ResetBlockFailureFlags(chainman.m_blockman.LookupBlockIndex(stale_block_hash));
@@ -337,7 +341,9 @@
     // bucket keeps the original position.
     BOOST_CHECK(LookupTx(txindex, unique_txid) == stale_block_hash);
 
-    BOOST_CHECK(BucketPositions(db, prefix) == original_bucket);
+    const auto reorg_bucket{BucketPositions(db, prefix)};
+    BOOST_REQUIRE_EQUAL(reorg_bucket.size(), 2U);
+    BOOST_CHECK(reorg_bucket.back() == original_bucket.front());
 
     txindex.Stop();
 }

</details>

in src/index/txindex.cpp:167 in ce62a0f5ea

 170 | +        it->Seek(txindex::BlockSeqKey{pos.block_seq});
 171 | +        txindex::BlockSeqKey key{};
 172 | +        uint256 block_hash;
 173 | +        if (!it->Valid() || !it->GetKey(key) || key.block_seq != pos.block_seq || !it->GetValue(block_hash)) {
 174 | +            continue;
 175 | +        }

l0rinc commented at 1:19 AM on July 22, 2026:

ce62a0f txindex: hash key prefixes and pack block positions:

Could we use the database point-read API here instead of reconstructing an exact lookup with the iterator?

Also, the block sequence to hash mapping is written in the same batch as the tx keys. So, if a txindex key parses but its block_seq can't be resolved, that doesn't sound like something we should just skip over. Is this a best-effort corruption skip? It's not obvious to me.

        uint256 block_hash;
        if (!m_db->Read(txindex::BlockSeqKey{pos.block_seq}, block_hash)) {
            LogError("Block sequence %u not found for txid %s", key.pos.block_seq, tx_hash.ToString());
            return false;
        }

in src/test/txindex_tests.cpp:194 in 20a8aaef4e

 189 | +    // confirm the lookup misses even though the legacy row exists.
 190 | +    // BlockTxPosition offsets are from the block start (header included), while
 191 | +    // the legacy CDiskTxPos.nTxOffset is measured after the header.
 192 | +    const CDiskTxPos fake_physical{BlockFilePos(*m_node.chainman, fake_pos.block_seq + 1), fake_pos.tx_offset_in_block - static_cast<uint32_t>(GetSerializeSize(CBlockHeader{}))};
 193 | +    db.Erase(txindex::DBKey{fake_prefix, fake_pos});
 194 | +    db.Write(std::make_pair(static_cast<uint8_t>('t'), fake_txid.ToUint256()), fake_physical);

l0rinc commented at 2:24 AM on July 22, 2026:

we could also extract a txindex::LegacyTxKey helper to simplify many of these to:

    db.Write(txindex::LegacyTxKey(fake_txid), fake_physical);

in src/dbwrapper.h:282 in 20a8aaef4e

 278 | @@ -279,6 +279,9 @@ class CDBWrapper
 279 |       */
 280 |      bool IsEmpty();
 281 |  
 282 | +    //! Return true if a database at path exists and contains at least 1 entry beginning with prefix.

l0rinc commented at 2:25 AM on July 22, 2026:

we could document what should happen in case of empty database:

    //! Probe an unopened database for a key prefix, empty databases return false.

l0rinc changes_requested

l0rinc commented at 3:28 AM on July 22, 2026: contributor

I took another line-by-line pass and found several places where the implementation and its corner cases could be easier to follow. I left additional comments and coverage suggestions, then pushed a reference version incorporating many of them at https://github.com/l0rinc/bitcoin/pull/244 (not all, please read my comments as well). It separates the lookup paths, streams collisions newest-first, simplifies duplicate handling, and expands the upgrade, reorg, and failure coverage.

sedited added this to the milestone 32.0 on Jul 22, 2026

arejula27 referenced this in commit dc27621ec9 on Jul 22, 2026

andrewtoth force-pushed on Jul 25, 2026

andrewtoth commented at 12:45 AM on July 26, 2026: contributor

Thanks for the detailed review @l0rinc! Rebased with your siphash upgrade and took most of your suggestions. The refactors and extra test coverage are clear improvements. Declined to take a few suggestions where I commented with my reasoning.

txindex: make TxIndex::FindTx [[nodiscard]]

Co-authored-by: l0rinc <pap.lorinc@gmail.com>

bd2834b47a

txindex: use a new block locator for downgrade safety

The hashed txindex entries cannot be found by older nodes. Record sync
progress under a new locator key so a downgraded node will not rely on
entries indexed by upgraded nodes, and instead continue syncing from the
legacy locator.

c0b000f4c0

txindex: pass the full block to DB::WriteTxs

Move the per-transaction position computation from CustomAppend into
DB::WriteTxs, so the DB layer receives the whole block instead of a
pre-built vector of positions. This is a non-functional refactor.

10e3838ed6

andrewtoth force-pushed on Jul 26, 2026

DrahtBot added the label CI failed on Jul 26, 2026

DrahtBot removed the label CI failed on Jul 26, 2026

in src/test/txindex_tests.cpp:45 in c47816c9e1

  40 |  
  41 | +// Grants tests access to the otherwise non-public txindex database handle.
  42 | +class TxIndexTest
  43 | +{
  44 | +public:
  45 | +    static CDBWrapper& GetDB(const TxIndex& txindex) { return static_cast<CDBWrapper&>(txindex.GetDB()); }

l0rinc commented at 10:45 PM on July 26, 2026:

c47816c tests: cover txindex hash prefix collisions and legacy fallback:

    static CDBWrapper& GetDB(const TxIndex& txindex) { return txindex.GetDB(); }

in doc/release-notes-35531.md:9 in 7493f5c18c outdated

   0 | @@ -0,0 +1,9 @@
   1 | +## Index
   2 | +
   3 | +- The transaction index (`-txindex`) now stores less data on disk; a fully
   4 | +  rebuilt index takes less than half the space. The index is backwards compatible,
   5 | +  so existing users will not see the space saving unless the index is recreated.
   6 | +  To do so, stop the node, delete the `<datadir>/indexes/txindex` directory, and
   7 | +  restart; rebuilding can take up to a few hours depending on hardware. Once
   8 | +  rebuilt, the index can no longer be read by previous releases, so downgrading
   9 | +  will require rebuilding it again. (#35531)

l0rinc commented at 10:46 PM on July 26, 2026:

7493f5c doc: add release notes for txindex disk usage and stale block lookups:

It will also require deleting again:

  will require deleting and rebuilding it again. (#35531)

andrewtoth commented at 11:50 AM on July 27, 2026:

Downgrading should not require deleting. It will pick up from genesis since there's no legacy locator written.

l0rinc commented at 3:38 PM on July 27, 2026:

My understanding is that the older release rebuilds full-txid rows in the same database without erasing the new format, so the new format would just linger around - shouldn't we document that we'd likely want to delete the new one before switching back?

andrewtoth commented at 9:09 PM on July 27, 2026:

I think if users are switching back and forth, they would not want to delete the database at all and keep the legacy index. They will have some duplicate entries for every block that is connected while running the new version. I'm not sure how to document that exactly.

l0rinc commented at 9:39 PM on July 27, 2026:

Is there an actual use case for this? Or is it just a theoretical one, that if they discover something in the new version they don't like, they should be able to use a previous one?

andrewtoth commented at 11:03 PM on July 29, 2026:

I think we should support graceful downgrading when we can. They might discover something they don't like, like a bug in the new version that affects one of their use cases. Or they might experiment with different versions, or have both versions installed on their system and mistakenly start the older version at a later time. These are just some examples I can come up with.

Kino1994 referenced this in commit 221eaed650 on Jul 26, 2026

in src/index/txindex.cpp:82 in 49f4f5127f

  79 | +static fs::path TxIndexDBPath() { return gArgs.GetDataDirNet() / "indexes" / "txindex"; }
  80 | +
  81 |  TxIndex::DB::DB(size_t n_cache_size, bool f_memory, bool f_wipe) :
  82 | -    BaseIndex::DB(gArgs.GetDataDirNet() / "indexes" / "txindex", n_cache_size, f_memory, f_wipe),
  83 | -    m_hasher{ReadOrCreateTxidHasher(*this)}
  84 | +    // Legacy entries are the only point lookups, so they alone benefit from bloom filters.

l0rinc commented at 12:39 AM on July 27, 2026:

49f4f51 txindex: skip bloom filters and legacy lookups for new databases:

The claim that legacy entries are the only point lookups no longer holds. FindHashedTx reads ['s', block_seq] for each candidate it resolves, and WriteTxs probes ['h', block_hash] when a block may already be indexed - both are point lookups that a bloom filter would accelerate.

andrewtoth commented at 3:17 PM on July 27, 2026:

Expanded the comment explaining why we don't want bloom filters if we have no legacy entries. The sequence entries are tiny (<1 million 3 byte entries) which will likely fit into one lookup block, so there won't be any misses helped by bloom filters. Also, bloom filters would have to be added for everything, which is wasted on the vast majority of txid entries. So not worth adding.

l0rinc commented at 6:21 PM on July 27, 2026:

Thanks, please resolve. I wonder if we should fine-tune the bits per index - likely not something we want to do here.

in src/index/txindex_key.h:25 in 5bdfc8017a

  20 | +#include <string>
  21 | +#include <utility>
  22 | +
  23 | +namespace txindex {
  24 | +//! Prefix of a legacy (pre-hashing) txindex row.
  25 | +constexpr uint8_t DB_TXINDEX{'t'};

l0rinc commented at 12:46 AM on July 27, 2026:

5bdfc80 txindex: hash key prefixes and pack block positions:

We could document the database layout since it's getting a bit cowded here (and move the legacy later, it's not important anymore. And now that we keep repeating the empty value, we might as well extract that as well:

namespace txindex {
/*
 * Database layout:
 *
 *   ['x', hash prefix, block seq, tx offset] -> (empty)
 *   ['s', block seq]                         -> block hash
 *   ['h', block hash]                        -> block seq
 *   ["next_block_seq"]                       -> next block seq to assign
 *   ["txid_hash_salt"]                       -> txid hasher salt
 *   ["best_block_v2"]                        -> current sync locator
 *   ['t', txid]                              -> legacy CDiskTxPos
 *   ['B']                                    -> legacy sync locator
 */
constexpr uint8_t DB_TXINDEX_HASHED{'x'};
constexpr uint8_t DB_BLOCK_SEQ{'s'};
constexpr uint8_t DB_BLOCK_HASH{'h'};
//! Prefix of a legacy (pre-hashing) txindex row.
constexpr uint8_t DB_TXINDEX{'t'};
inline const std::string DB_TXID_HASH_SALT{"txid_hash_salt"};
inline const std::string DB_NEXT_BLOCK_SEQ{"next_block_seq"};
inline const std::string DB_BEST_BLOCK_V2{"best_block_v2"};

//! Empty value of a hashed txindex row, whose position is encoded in its key.
inline constexpr std::array<std::byte, 0> EMPTY_VALUE{};

<details><summary>centralize txindex schema definitions</summary>

diff --git a/src/index/txindex.cpp b/src/index/txindex.cpp
index 1b80d0fea8..db9278e137 100644
--- a/src/index/txindex.cpp
+++ b/src/index/txindex.cpp
@@ -26,9 +26,7 @@
 #include <validation.h>
 
 #include <algorithm>
-#include <array>
 #include <cassert>
-#include <cstddef>
 #include <cstdint>
 #include <cstdio>
 #include <exception>
@@ -37,8 +35,6 @@
 #include <utility>
 #include <vector>
 
-const std::string DB_BEST_BLOCK_V2{"best_block_v2"};
-
 std::unique_ptr<TxIndex> g_txindex;
 
 namespace {
@@ -99,7 +95,7 @@ TxIndex::DB::DB(size_t n_cache_size, bool f_memory, bool f_wipe, bool has_legacy
 CBlockLocator TxIndex::DB::ReadBestBlock() const
 {
     CBlockLocator locator;
-    if (Read(DB_BEST_BLOCK_V2, locator)) {
+    if (Read(txindex::DB_BEST_BLOCK_V2, locator)) {
         return locator;
     }
     // If we don't have a locator yet, start from the legacy best block.
@@ -108,7 +104,7 @@ CBlockLocator TxIndex::DB::ReadBestBlock() const
 
 void TxIndex::DB::WriteBestBlock(CDBBatch& batch, const CBlockLocator& locator)
 {
-    batch.Write(DB_BEST_BLOCK_V2, locator);
+    batch.Write(txindex::DB_BEST_BLOCK_V2, locator);
 }
 
 void TxIndex::DB::WriteTxs(const interfaces::BlockInfo& block)
@@ -125,9 +121,7 @@ void TxIndex::DB::WriteTxs(const interfaces::BlockInfo& block)
     for (const auto& tx : block.data->vtx) {
         const txindex::DBKey key{txindex::CreateKeyPrefix(m_hasher, tx->GetHash()),
                                  txindex::BlockTxPosition{block_seq, tx_offset_in_block}};
-        // The tx position is encoded in the key, so the value is intentionally
-        // empty. A 0-length byte array avoids the spurious '\0' that "" would store.
-        batch.Write(key, std::array<std::byte, 0>{});
+        batch.Write(key, txindex::EMPTY_VALUE);
         tx_offset_in_block += tx->ComputeTotalSize();
     }
     WriteBatch(batch);
diff --git a/src/index/txindex_key.h b/src/index/txindex_key.h
index f91f8ed7dd..efcb0ac8b2 100644
--- a/src/index/txindex_key.h
+++ b/src/index/txindex_key.h
@@ -11,19 +11,37 @@
 #include <serialize.h>
 #include <uint256.h>
 
+#include <array>
+#include <cstddef>
 #include <cstdint>
 #include <ios>
 #include <string>
 #include <utility>
 
 namespace txindex {
-//! Prefix of a legacy (pre-hashing) txindex row.
-constexpr uint8_t DB_TXINDEX{'t'};
+/*
+ * Database layout:
+ *
+ *   ['x', hash prefix, block seq, tx offset] -> (empty)
+ *   ['s', block seq]                         -> block hash
+ *   ['h', block hash]                        -> block seq
+ *   ["next_block_seq"]                       -> next block seq to assign
+ *   ["txid_hash_salt"]                       -> txid hasher salt
+ *   ["best_block_v2"]                        -> current sync locator
+ *   ['t', txid]                              -> legacy CDiskTxPos
+ *   ['B']                                    -> legacy sync locator
+ */
 constexpr uint8_t DB_TXINDEX_HASHED{'x'};
 constexpr uint8_t DB_BLOCK_SEQ{'s'};
 constexpr uint8_t DB_BLOCK_HASH{'h'};
+//! Prefix of a legacy (pre-hashing) txindex row.
+constexpr uint8_t DB_TXINDEX{'t'};
 inline const std::string DB_TXID_HASH_SALT{"txid_hash_salt"};
 inline const std::string DB_NEXT_BLOCK_SEQ{"next_block_seq"};
+inline const std::string DB_BEST_BLOCK_V2{"best_block_v2"};
+
+//! Empty value of a hashed txindex row, whose position is encoded in its key.
+inline constexpr std::array<std::byte, 0> EMPTY_VALUE{};
 
 //! Serialized size of a block header, the offset of the first byte after it.
 constexpr uint32_t BLOCK_HEADER_SIZE{80};
diff --git a/src/test/txindex_tests.cpp b/src/test/txindex_tests.cpp
index f829527dcd..fcbc32ab88 100644
--- a/src/test/txindex_tests.cpp
+++ b/src/test/txindex_tests.cpp
@@ -250,7 +250,7 @@ BOOST_FIXTURE_TEST_CASE(txindex_collision_scan_path, TestChain100Setup)
     BOOST_REQUIRE_EQUAL(fake_bucket.size(), 1U);
     const txindex::BlockTxPosition fake_pos{fake_bucket.front()};
 
-    db.Write(txindex::DBKey{target_prefix, fake_pos}, std::array<std::byte, 0>{});
+    db.Write(txindex::DBKey{target_prefix, fake_pos}, txindex::EMPTY_VALUE);
 
     // The target's bucket now holds the real target first (lower sequence
     // number), then the forged false positive, which the descending scan tries first.

</details>

in src/index/txindex_key.h:99 in 5bdfc8017a

  94 | +    std::array<std::byte, sizeof(uint64_t)> be_hash;
  95 | +    WriteBE64(be_hash.data(), hasher.Hash(txid.ToUint256()));
  96 | +    TxHashKeyPrefix prefix;
  97 | +    std::memcpy(prefix.data(), be_hash.data(), prefix.size());
  98 | +    return prefix;
  99 | +}

l0rinc commented at 1:07 AM on July 27, 2026:

5bdfc80 txindex: hash key prefixes and pack block positions:

Now that the size is fixed, I just realized we're reinventing integer serialization here:

constexpr int HASH_PREFIX_SIZE{5};
using TxHashKeyPrefix = uint64_t;

inline TxHashKeyPrefix CreateKeyPrefix(const SipHasher13UJ& hasher, const Txid& txid)
{
    return hasher.Hash(txid.ToUint256()) >> (8 * (sizeof(TxHashKeyPrefix) - HASH_PREFIX_SIZE));
}

struct DBKey {
    TxHashKeyPrefix hash_prefix{0};
    BlockTxPosition pos;

    SERIALIZE_METHODS(DBKey, obj)
    {
        uint8_t prefix{DB_TXINDEX_HASHED};
        READWRITE(prefix);
        if (ser_action.ForRead() && prefix != DB_TXINDEX_HASHED) throw std::ios_base::failure("Invalid format for txindex DB key");
        READWRITE(Using<BigEndianFormatter<HASH_PREFIX_SIZE>>(obj.hash_prefix), obj.pos);
    }
};
} // namespace txindex

it would also simplify it->Seek(std::pair{txindex::DB_TXINDEX_HASHED, prefix}) to it->Seek(key) (and we wouldn't need to pass by reference anymore and testing would look like BOOST_CHECK_EQUAL(HexStr(DataStream{} << txindex::DBKey{0x0102030405, {1, 2}}),)

<details><summary>serialize txindex prefixes as integers</summary>

diff --git a/src/index/txindex.cpp b/src/index/txindex.cpp
index c1c791cb37..1b80d0fea8 100644
--- a/src/index/txindex.cpp
+++ b/src/index/txindex.cpp
@@ -164,7 +164,7 @@ bool TxIndex::FindHashedTx(const Txid& tx_hash, uint256& block_hash, CTransactio
     {
         std::unique_ptr<CDBIterator> it{m_db->NewIterator()};
         const txindex::TxHashKeyPrefix prefix{txindex::CreateKeyPrefix(m_db->m_hasher, tx_hash)};
-        it->Seek(std::pair{txindex::DB_TXINDEX_HASHED, prefix});
+        it->Seek(txindex::DBKey{prefix, {}});
         txindex::DBKey key{prefix, {}};
         for (; it->Valid() && it->GetKey(key) && key.hash_prefix == prefix; it->Next()) {
             uint256 candidate_block_hash;
diff --git a/src/index/txindex_key.h b/src/index/txindex_key.h
index edc47296e2..f91f8ed7dd 100644
--- a/src/index/txindex_key.h
+++ b/src/index/txindex_key.h
@@ -6,16 +6,12 @@
 #define BITCOIN_INDEX_TXINDEX_KEY_H
 
 #include <consensus/consensus.h>
-#include <crypto/common.h>
 #include <crypto/siphash.h>
 #include <primitives/transaction.h>
 #include <serialize.h>
 #include <uint256.h>
 
-#include <array>
-#include <cstddef>
 #include <cstdint>
-#include <cstring>
 #include <ios>
 #include <string>
 #include <utility>
@@ -87,15 +83,12 @@ inline std::pair<uint8_t, uint256> LegacyTxKey(const Txid& txid)
     return {DB_TXINDEX, txid.ToUint256()};
 }
 
-using TxHashKeyPrefix = std::array<std::byte, 5>;
+constexpr int HASH_PREFIX_SIZE{5};
+using TxHashKeyPrefix = uint64_t;
 
 inline TxHashKeyPrefix CreateKeyPrefix(const SipHasher13UJ& hasher, const Txid& txid)
 {
-    std::array<std::byte, sizeof(uint64_t)> be_hash;
-    WriteBE64(be_hash.data(), hasher.Hash(txid.ToUint256()));
-    TxHashKeyPrefix prefix;
-    std::memcpy(prefix.data(), be_hash.data(), prefix.size());
-    return prefix;
+    return hasher.Hash(txid.ToUint256()) >> (8 * (sizeof(TxHashKeyPrefix) - HASH_PREFIX_SIZE));
 }
 
 struct DBKey {
@@ -107,7 +100,7 @@ struct DBKey {
         uint8_t prefix{DB_TXINDEX_HASHED};
         READWRITE(prefix);
         if (ser_action.ForRead() && prefix != DB_TXINDEX_HASHED) throw std::ios_base::failure("Invalid format for txindex DB key");
-        READWRITE(obj.hash_prefix, obj.pos);
+        READWRITE(Using<BigEndianFormatter<HASH_PREFIX_SIZE>>(obj.hash_prefix), obj.pos);
     }
 };
 } // namespace txindex
diff --git a/src/test/txindex_tests.cpp b/src/test/txindex_tests.cpp
index 3d5645a508..f829527dcd 100644
--- a/src/test/txindex_tests.cpp
+++ b/src/test/txindex_tests.cpp
@@ -62,12 +62,12 @@ SipHasher13UJ ReadHasher(const CDBWrapper& db)
     return SipHasher13UJ{salt.first, salt.second};
 }
 
-std::vector<txindex::BlockTxPosition> BucketPositions(CDBWrapper& db, const txindex::TxHashKeyPrefix& prefix)
+std::vector<txindex::BlockTxPosition> BucketPositions(CDBWrapper& db, txindex::TxHashKeyPrefix prefix)
 {
     std::vector<txindex::BlockTxPosition> positions;
     std::unique_ptr<CDBIterator> it{db.NewIterator()};
     txindex::DBKey key{prefix, {}};
-    for (it->Seek(std::pair{txindex::DB_TXINDEX_HASHED, prefix}); it->Valid() && it->GetKey(key) && key.hash_prefix == prefix; it->Next()) {
+    for (it->Seek(key); it->Valid() && it->GetKey(key) && key.hash_prefix == prefix; it->Next()) {
         positions.push_back(key.pos);
     }
     return positions;
@@ -120,7 +120,7 @@ BOOST_AUTO_TEST_CASE(txindex_position_encoding)
 
     // Pin the full key encodings, including the type prefixes.
     BOOST_CHECK_EQUAL(HexStr(DataStream{} << txindex::BlockSeqKey{1}), "73000001");
-    BOOST_CHECK_EQUAL(HexStr(DataStream{} << txindex::DBKey{{std::byte{1}, std::byte{2}, std::byte{3}, std::byte{4}, std::byte{5}}, {1, 2}}),
+    BOOST_CHECK_EQUAL(HexStr(DataStream{} << txindex::DBKey{0x0102030405, {1, 2}}),
                       "780102030405000001000002");
 
     BOOST_CHECK_EQUAL(txindex::BLOCK_HEADER_SIZE, GetSerializeSize(CBlockHeader{}));
@@ -129,10 +129,10 @@ BOOST_AUTO_TEST_CASE(txindex_position_encoding)
 BOOST_AUTO_TEST_CASE(txindex_hash_prefix)
 {
     BOOST_CHECK_EQUAL(
-        HexStr(txindex::CreateKeyPrefix(
+        txindex::CreateKeyPrefix(
             SipHasher13UJ{0x0706050403020100ULL, 0x0F0E0D0C0B0A0908ULL},
-            Txid{"1f1e1d1c1b1a191817161514131211100f0e0d0c0b0a09080706050403020100"})),
-        "c67d87b08c");
+            Txid{"1f1e1d1c1b1a191817161514131211100f0e0d0c0b0a09080706050403020100"}),
+        0xc67d87b08cULL);
 }
 
 BOOST_FIXTURE_TEST_CASE(txindex_initial_sync, TestChain100Setup)

</details>

in src/rpc/rawtransaction.cpp:159 in bd2834b47a

 152 | @@ -153,7 +153,7 @@ PartiallySignedTransaction ProcessPSBT(const std::string& psbt_string, const std
 153 |          // Look in the txindex
 154 |          if (g_txindex) {
 155 |              uint256 block_hash;
 156 | -            g_txindex->FindTx(psbt_input.prev_txid, block_hash, tx);
 157 | +            if (!g_txindex->FindTx(psbt_input.prev_txid, block_hash, tx)) tx.reset();
 158 |          }
 159 |          // If we still don't have it look in the mempool
 160 |          if (!tx) {

l0rinc commented at 1:16 AM on July 27, 2026:

bd2834b txindex: make TxIndex::FindTx [[nodiscard]]:

ProcessPSBT initializes tx to null for every input. After a failed txindex lookup it resets that already-null pointer, then checks the pointer again before looking in the mempool.

Since FindTx leaves its output arguments unchanged on failure now, could we use that result directly?

        CTransactionRef tx;
        uint256 block_hash;
        if (!g_txindex || !g_txindex->FindTx(psbt_input.prev_txid, block_hash, tx)) {
            tx = node.mempool->get(psbt_input.prev_txid);
        }
        if (tx) {

andrewtoth commented at 3:18 PM on July 27, 2026:

I think this is out of scope. Let's leave callsites alone for now.

in src/index/txindex.cpp:70 in 7493f5c18c outdated

  71 | +    const SipHasher13UJ m_hasher;
  72 |  
  73 | -    /// Write a batch of transaction positions to the DB.
  74 | -    void WriteTxs(const std::vector<std::pair<Txid, CDiskTxPos>>& v_pos);
  75 | +    /// Whether the database contains any legacy ('t' + txid) entries.
  76 | +    const bool m_has_legacy;

l0rinc commented at 1:23 AM on July 27, 2026:

I wouldn't expect most users to read the release notes and we'd like to nudge users to use this - should we maybe add an innocent loginfo when we have legacy values?

TxIndex::TxIndex(std::unique_ptr<interfaces::Chain> chain, size_t n_cache_size, bool f_memory, bool f_wipe)
    : BaseIndex(std::move(chain), "txindex", "txidx"), m_db(std::make_unique<TxIndex::DB>(n_cache_size, f_memory, f_wipe))
{
    if (m_db->m_has_legacy) {
        LogInfo("txindex contains entries in the pre-hashing format; lookups will check both formats. "
                "To reclaim disk space, stop the node, delete %s and restart to rebuild the index.",
                fs::PathToString(TxIndexDBPath()));
    }
}

andrewtoth commented at 3:18 PM on July 27, 2026:

Done, also added that they should not do this if they are expecting to downgrade.

in src/index/txindex.cpp:135 in 5bdfc8017a

 131 | @@ -104,31 +132,101 @@ bool TxIndex::CustomAppend(const interfaces::BlockInfo& block)
 132 |  
 133 |  BaseIndex::DB& TxIndex::GetDB() const { return *m_db; }
 134 |  
 135 | -bool TxIndex::FindTx(const Txid& tx_hash, uint256& block_hash, CTransactionRef& tx) const
 136 | +bool TxIndex::FindHashedTx(const Txid& tx_hash, uint256& block_hash, CTransactionRef& tx) const

l0rinc commented at 1:54 AM on July 27, 2026:

5bdfc80 txindex: hash key prefixes and pack block positions:

Thanks for cleaning this up, now I can see that we have a fake return value + 2 fake (output) parameters here with a trust-me-bro comment guaranteeing that both remain unchanged on failure. We could modernize/simplify this by returning an optional instead:

struct TxIndexResult {
    uint256 block_hash;
    CTransactionRef tx;
};

like:

-    bool FindHashedTx(const Txid& tx_hash, uint256& block_hash, CTransactionRef& tx) const;
+    std::optional<TxIndexResult> FindHashedTx(const Txid& tx_hash) const;

and indtead of

    if (g_txindex) {
        CTransactionRef tx;
        uint256 block_hash;
        if (g_txindex->FindTx(hash, block_hash, tx)) {
            if (!block_index || block_index->GetBlockHash() == block_hash) {
                // Don't return the transaction if the provided block hash doesn't match.
                // The case where a transaction appears in multiple blocks (e.g. reorgs or
                // BIP30) is handled by the block lookup below.
                hashBlock = block_hash;
                return tx;
            }
        }
    }

we can do:

    auto result{g_txindex ? g_txindex->FindTx(hash) : std::nullopt};
    if (result && (!block_index || block_index->GetBlockHash() == result->block_hash)) {
        // Don't return the transaction if the provided block hash doesn't match.
        // The case where a transaction appears in multiple blocks (e.g. reorgs or
        // BIP30) is handled by the block lookup below.
        hashBlock = result->block_hash;
        return result->tx;
    }

<details><summary>return txindex lookup result</summary>

diff --git a/src/index/txindex.cpp b/src/index/txindex.cpp
index 70b189c42c..677e9a173a 100644
--- a/src/index/txindex.cpp
+++ b/src/index/txindex.cpp
@@ -31,6 +31,7 @@
 #include <cstdio>
 #include <exception>
 #include <functional>
+#include <optional>
 #include <string>
 #include <utility>
 #include <vector>
@@ -152,7 +153,7 @@ bool TxIndex::CustomAppend(const interfaces::BlockInfo& block)
 
 BaseIndex::DB& TxIndex::GetDB() const { return *m_db; }
 
-bool TxIndex::FindHashedTx(const Txid& tx_hash, uint256& block_hash, CTransactionRef& tx) const
+std::optional<TxIndexResult> TxIndex::FindHashedTx(const Txid& tx_hash) const
 {
     struct Candidate {
         FlatFilePos tx_position;
@@ -203,25 +204,23 @@ bool TxIndex::FindHashedTx(const Txid& tx_hash, uint256& block_hash, CTransactio
             continue;
         }
         if (candidate_tx->GetHash() == tx_hash) {
-            tx = std::move(candidate_tx);
-            block_hash = candidate.block_hash;
-            return true;
+            return TxIndexResult{candidate.block_hash, std::move(candidate_tx)};
         }
     }
-    return false;
+    return std::nullopt;
 }
 
-bool TxIndex::FindLegacyTx(const Txid& tx_hash, uint256& block_hash, CTransactionRef& tx) const
+std::optional<TxIndexResult> TxIndex::FindLegacyTx(const Txid& tx_hash) const
 {
     CDiskTxPos postx;
     if (!m_db->Read(txindex::LegacyTxKey(tx_hash), postx)) {
-        return false;
+        return std::nullopt;
     }
 
     AutoFile file{m_chainstate->m_blockman.OpenBlockFile(postx, /*fReadOnly=*/true)};
     if (file.IsNull()) {
         LogError("OpenBlockFile failed");
-        return false;
+        return std::nullopt;
     }
     CBlockHeader header;
     CTransactionRef candidate_tx;
@@ -231,22 +230,20 @@ bool TxIndex::FindLegacyTx(const Txid& tx_hash, uint256& block_hash, CTransactio
         file >> TX_WITH_WITNESS(candidate_tx);
     } catch (const std::exception& e) {
         LogError("Deserialize or I/O error - %s", e.what());
-        return false;
+        return std::nullopt;
     }
     if (candidate_tx->GetHash() != tx_hash) {
         LogError("txid mismatch");
-        return false;
+        return std::nullopt;
     }
-    tx = std::move(candidate_tx);
-    block_hash = header.GetHash();
-    return true;
+    return TxIndexResult{header.GetHash(), std::move(candidate_tx)};
 }
 
-bool TxIndex::FindTx(const Txid& tx_hash, uint256& block_hash, CTransactionRef& tx) const
+std::optional<TxIndexResult> TxIndex::FindTx(const Txid& tx_hash) const
 {
-    if (FindHashedTx(tx_hash, block_hash, tx)) return true;
+    if (auto result{FindHashedTx(tx_hash)}) return result;
 
     // Fall back to legacy if no hashed entry matched. This makes misses pay an
     // extra lookup, but keeps existing full-txid entries readable after upgrade.
-    return m_db->m_has_legacy && FindLegacyTx(tx_hash, block_hash, tx);
+    return m_db->m_has_legacy ? FindLegacyTx(tx_hash) : std::nullopt;
 }
diff --git a/src/index/txindex.h b/src/index/txindex.h
index 5f866d6ac7..c25f188555 100644
--- a/src/index/txindex.h
+++ b/src/index/txindex.h
@@ -7,11 +7,12 @@
 
 #include <index/base.h>
 #include <primitives/transaction.h>
+#include <uint256.h>
 
 #include <cstddef>
 #include <memory>
+#include <optional>
 
-class uint256;
 namespace interfaces {
 class Chain;
 }
@@ -21,6 +22,11 @@ class TxIndexTest;
 
 static constexpr bool DEFAULT_TXINDEX{false};
 
+struct TxIndexResult {
+    uint256 block_hash;
+    CTransactionRef tx;
+};
+
 /**
  * TxIndex is used to look up transactions included in the blockchain by hash.
  * The index is written to a LevelDB database and records the block sequence
@@ -38,10 +44,10 @@ private:
     bool AllowPrune() const override { return false; }
 
     /// Look up a transaction in the hashed-prefix index.
-    bool FindHashedTx(const Txid& tx_hash, uint256& block_hash, CTransactionRef& tx) const;
+    std::optional<TxIndexResult> FindHashedTx(const Txid& tx_hash) const;
 
     /// Look up a transaction among the legacy (full-txid) entries.
-    bool FindLegacyTx(const Txid& tx_hash, uint256& block_hash, CTransactionRef& tx) const;
+    std::optional<TxIndexResult> FindLegacyTx(const Txid& tx_hash) const;
 
 protected:
     bool CustomAppend(const interfaces::BlockInfo& block) override;
@@ -58,10 +64,8 @@ public:
     /// Look up a transaction by hash.
     ///
     /// [@param](/bitcoin-bitcoin/contributor/param/)[in]   tx_hash  The hash of the transaction to be returned.
-    /// [@param](/bitcoin-bitcoin/contributor/param/)[out]  block_hash  The hash of the block the transaction is found in. Unchanged if false is returned.
-    /// [@param](/bitcoin-bitcoin/contributor/param/)[out]  tx  The transaction itself. Unchanged if false is returned.
-    /// [@return](/bitcoin-bitcoin/contributor/return/)  true if transaction is found, false otherwise
-    [[nodiscard]] bool FindTx(const Txid& tx_hash, uint256& block_hash, CTransactionRef& tx) const;
+    /// [@return](/bitcoin-bitcoin/contributor/return/)  The transaction and containing block hash, or nullopt if it is not found.
+    [[nodiscard]] std::optional<TxIndexResult> FindTx(const Txid& tx_hash) const;
 };
 
 /// The global transaction index, used in GetTransaction. May be null.
diff --git a/src/node/transaction.cpp b/src/node/transaction.cpp
index d331ae05aa..d4355b9596 100644
--- a/src/node/transaction.cpp
+++ b/src/node/transaction.cpp
@@ -15,6 +15,8 @@
 #include <validationinterface.h>
 #include <node/transaction.h>
 
+#include <optional>
+
 namespace node {
 static TransactionError HandleATMPError(const TxValidationState& state, std::string& err_string_out)
 {
@@ -145,18 +147,13 @@ CTransactionRef GetTransaction(const CBlockIndex* const block_index, const CTxMe
         CTransactionRef ptx = mempool->get(hash);
         if (ptx) return ptx;
     }
-    if (g_txindex) {
-        CTransactionRef tx;
-        uint256 block_hash;
-        if (g_txindex->FindTx(hash, block_hash, tx)) {
-            if (!block_index || block_index->GetBlockHash() == block_hash) {
-                // Don't return the transaction if the provided block hash doesn't match.
-                // The case where a transaction appears in multiple blocks (e.g. reorgs or
-                // BIP30) is handled by the block lookup below.
-                hashBlock = block_hash;
-                return tx;
-            }
-        }
+    auto result{g_txindex ? g_txindex->FindTx(hash) : std::nullopt};
+    if (result && (!block_index || block_index->GetBlockHash() == result->block_hash)) {
+        // Don't return the transaction if the provided block hash doesn't match.
+        // The case where a transaction appears in multiple blocks (e.g. reorgs or
+        // BIP30) is handled by the block lookup below.
+        hashBlock = result->block_hash;
+        return result->tx;
     }
     if (block_index) {
         CBlock block;
diff --git a/src/rpc/rawtransaction.cpp b/src/rpc/rawtransaction.cpp
index c50f1135de..29cc39c10e 100644
--- a/src/rpc/rawtransaction.cpp
+++ b/src/rpc/rawtransaction.cpp
@@ -148,11 +148,8 @@ PartiallySignedTransaction ProcessPSBT(const std::string& psbt_string, const std
         // The `non_witness_utxo` is the whole previous transaction
         if (psbt_input.non_witness_utxo) continue;
 
-        CTransactionRef tx;
-        uint256 block_hash;
-        if (!g_txindex || !g_txindex->FindTx(psbt_input.prev_txid, block_hash, tx)) {
-            tx = node.mempool->get(psbt_input.prev_txid);
-        }
+        const auto txindex_result{g_txindex ? g_txindex->FindTx(psbt_input.prev_txid) : std::nullopt};
+        CTransactionRef tx{txindex_result ? txindex_result->tx : node.mempool->get(psbt_input.prev_txid)};
         if (tx) {
             psbt_input.non_witness_utxo = tx;
         } else {
diff --git a/src/test/txindex_tests.cpp b/src/test/txindex_tests.cpp
index fcbc32ab88..9fbfe19825 100644
--- a/src/test/txindex_tests.cpp
+++ b/src/test/txindex_tests.cpp
@@ -84,11 +84,9 @@ FlatFilePos BlockFilePos(const ChainstateManager& chainman, uint32_t height)
 //! Look up a transaction, requiring success, and return the containing block's hash.
 uint256 LookupTx(const TxIndex& txindex, const Txid& txid)
 {
-    CTransactionRef tx;
-    uint256 block_hash;
-    BOOST_REQUIRE(txindex.FindTx(txid, block_hash, tx));
-    BOOST_CHECK(Assert(tx)->GetHash() == txid);
-    return block_hash;
+    const auto result{txindex.FindTx(txid)};
+    BOOST_CHECK(Assert(result)->tx->GetHash() == txid);
+    return result->block_hash;
 }
 
 void InvalidateBlock(ChainstateManager& chainman, const uint256& block_hash)
@@ -140,12 +138,9 @@ BOOST_FIXTURE_TEST_CASE(txindex_initial_sync, TestChain100Setup)
     TxIndex txindex(interfaces::MakeChain(m_node), /*n_cache_size=*/1_MiB, /*f_memory=*/true);
     BOOST_REQUIRE(txindex.Init());
 
-    CTransactionRef tx_disk;
-    uint256 block_hash;
-
     // Transaction should not be found in the index before it is started.
     for (const auto& txn : m_coinbase_txns) {
-        BOOST_CHECK(!txindex.FindTx(txn->GetHash(), block_hash, tx_disk));
+        BOOST_CHECK(!txindex.FindTx(txn->GetHash()));
     }
 
     // BlockUntilSyncedToCurrentChain should return false before txindex is started.
@@ -156,7 +151,7 @@ BOOST_FIXTURE_TEST_CASE(txindex_initial_sync, TestChain100Setup)
     // Check that txindex excludes genesis block transactions.
     const CBlock& genesis_block = Params().GenesisBlock();
     for (const auto& txn : genesis_block.vtx) {
-        BOOST_CHECK(!txindex.FindTx(txn->GetHash(), block_hash, tx_disk));
+        BOOST_CHECK(!txindex.FindTx(txn->GetHash()));
     }
 
     // Check that txindex has all txs that were in the chain before it started.
@@ -270,9 +265,7 @@ BOOST_FIXTURE_TEST_CASE(txindex_collision_scan_path, TestChain100Setup)
     const CDiskTxPos fake_physical{BlockFilePos(*m_node.chainman, fake_pos.block_seq + 1), fake_pos.tx_offset_in_block - txindex::BLOCK_HEADER_SIZE};
     db.Erase(txindex::DBKey{fake_prefix, fake_pos});
     db.Write(txindex::LegacyTxKey(fake_txid), fake_physical);
-    CTransactionRef legacy_tx;
-    uint256 block_hash;
-    BOOST_CHECK(!txindex.FindTx(fake_txid, block_hash, legacy_tx));
+    BOOST_CHECK(!txindex.FindTx(fake_txid));
 
     txindex.Stop();
 }

</details>

andrewtoth commented at 3:20 PM on July 27, 2026:

Updated the internal methods to do this, but declined to change callsites for this PR. I think in the future when we enable pruning, we can return an expected with the tx and blockhash, and unexpected will be the pruned block hashes if any were found.

in src/index/txindex.cpp:163 in 5bdfc8017a outdated

 159 | +                LogError("Block index entry %s not found for txid %s", candidate_block_hash.ToString(), tx_hash.ToString());
 160 | +                continue;
 161 | +            }
 162 | +            if (!(block_index->nStatus & BLOCK_HAVE_DATA)) continue;
 163 | +            const FlatFilePos tx_position{block_index->nFile, block_index->nDataPos + key.pos.tx_offset_in_block};
 164 | +            candidates.emplace_back(tx_position, candidate_block_hash, key.pos.block_seq, m_chainstate->m_chain.Contains(*block_index));

l0rinc commented at 2:14 AM on July 27, 2026:

5bdfc80 txindex: hash key prefixes and pack block positions:

We're still iterating eagerly, can you please give me a hint why you think that's better? See my previous comments here #35531 (review).

andrewtoth commented at 11:58 AM on July 27, 2026:

This lets us collapse the loops into a single one. I think it makes it easier to read. We already have the iterator created and only expect at most a few values anyways. I don't think we gain anything by not iterating all values. The lookup time is dominated by reading and deserializing the txs anyways.

l0rinc commented at 6:23 PM on July 27, 2026:

In that case can we extract the collection so that the locking on main is minimal?

andrewtoth commented at 9:20 PM on July 27, 2026:

Is the code complexity tradeoff worth it for this? The vast majority of time we will only have 1 row to lookup, so the lock will be taken once. Otherwise it's just a release and take of the lock. I don't think taking the lock 2 or 3 times will have a noticeable effect for users.

l0rinc commented at 9:40 PM on July 27, 2026:

Is the code complexity tradeoff worth it for this

Probably not, I will benchmark it with real data to see.

in src/index/txindex_key.h:35 in 5bdfc8017a outdated

  30 | +inline const std::string DB_NEXT_BLOCK_SEQ{"next_block_seq"};
  31 | +
  32 | +//! Serialized size of a block header, the offset of the first byte after it.
  33 | +constexpr uint32_t BLOCK_HEADER_SIZE{80};
  34 | +
  35 | +//! The location of a transaction: the sequence number of the block that contains it

l0rinc commented at 2:24 AM on July 27, 2026:

5bdfc80 txindex: hash key prefixes and pack block positions:

a block reorged out then reorged back would get duplicate entries added

I have a few objections here.

First, rebuilding txindex discards the old stale-block entries because we're only iterating the active chain, so users should probably be aware that they will lose the stale blocks. This is also the current behavior on reindex. Also, during IBD we won't have reorgs, so there's no point in checking for duplicates. Which begs the question: why are we complicating this with extra lookups just to store barely any of these in steady state (and whether the bloom filters are still not needed in the latest version).

Second, sequence numbers only start diverging from heights after IBD, on the first actual steady-state reorg, so we don't need to check for duplicates at all until the sequence is height - 1.

<details><summary>avoid txindex rebuild reads</summary>

diff --git a/src/index/txindex.cpp b/src/index/txindex.cpp
index 01d0e7dbfe..e0cdd81222 100644
--- a/src/index/txindex.cpp
+++ b/src/index/txindex.cpp
@@ -69,6 +69,9 @@ public:
     /// Whether the database contains any legacy ('t' + txid) entries.
     const bool m_has_legacy;

+    /// Sequence number to assign to the next newly indexed block.
+    uint32_t m_next_block_seq{0};
+
     CBlockLocator ReadBestBlock() const override;
     void WriteBestBlock(CDBBatch& batch, const CBlockLocator& locator) override;

@@ -88,7 +91,9 @@ TxIndex::DB::DB(size_t n_cache_size, bool f_memory, bool f_wipe, bool has_legacy
     BaseIndex::DB(TxIndexDBPath(), n_cache_size, f_memory, f_wipe, /*f_obfuscate=*/false, /*f_bloom=*/has_legacy),
     m_hasher{ReadOrCreateTxidHasher(*this)},
     m_has_legacy{has_legacy}
-{}
+{
+    Read(txindex::DB_NEXT_BLOCK_SEQ, m_next_block_seq);
+}

 CBlockLocator TxIndex::DB::ReadBestBlock() const
 {
@@ -107,11 +112,9 @@ void TxIndex::DB::WriteBestBlock(CDBBatch& batch, const CBlockLocator& locator)

 void TxIndex::DB::WriteTxs(const interfaces::BlockInfo& block)
 {
-    // A block that reconnects after a reorg keeps its original sequence number.
-    if (Exists(txindex::BlockHashKey{block.hash})) return;
-
-    uint32_t block_seq{0};
-    Read(txindex::DB_NEXT_BLOCK_SEQ, block_seq);
+    // A fresh rebuild assigns sequence height - 1, so only check for duplicates after they diverge.
+    auto block_seq{m_next_block_seq};
+    if (block_seq != uint32_t(block.height - 1) && Exists(txindex::BlockHashKey{block.hash})) return;

     CDBBatch batch(*this);
     batch.Write(txindex::BlockHashKey{block.hash}, block_seq);
@@ -127,6 +130,7 @@ void TxIndex::DB::WriteTxs(const interfaces::BlockInfo& block)
         tx_offset_in_block += tx->ComputeTotalSize();
     }
     WriteBatch(batch);
+    m_next_block_seq = block_seq + 1;
 }

 TxIndex::TxIndex(std::unique_ptr<interfaces::Chain> chain, size_t n_cache_size, bool f_memory, bool f_wipe)
diff --git a/src/test/txindex_tests.cpp b/src/test/txindex_tests.cpp
index ce375fc50d..3d5645a508 100644
--- a/src/test/txindex_tests.cpp
+++ b/src/test/txindex_tests.cpp
@@ -201,6 +201,28 @@ BOOST_FIXTURE_TEST_CASE(txindex_locator_upgrade, TestChain100Setup)
     BOOST_CHECK(stored_legacy_locator.vHave == legacy_locator.vHave);
 }

+BOOST_FIXTURE_TEST_CASE(txindex_sequence_restart, TestChain100Setup)
+{
+    uint32_t expected_seq;
+    {
+        TxIndex txindex(interfaces::MakeChain(m_node), /*n_cache_size=*/1_MiB, /*f_memory=*/false);
+        BOOST_REQUIRE(txindex.Init());
+        txindex.Sync();
+        BOOST_REQUIRE(TxIndexTest::GetDB(txindex).Read(txindex::DB_NEXT_BLOCK_SEQ, expected_seq));
+        txindex.Stop();
+    }
+
+    const CBlock block{CreateAndProcessBlock({}, CScript() << OP_TRUE)};
+    TxIndex txindex(interfaces::MakeChain(m_node), /*n_cache_size=*/1_MiB, /*f_memory=*/false);
+    BOOST_REQUIRE(txindex.Init());
+    txindex.Sync();
+
+    uint32_t block_seq;
+    BOOST_REQUIRE(TxIndexTest::GetDB(txindex).Read(txindex::BlockHashKey{block.GetHash()}, block_seq));
+    BOOST_CHECK_EQUAL(block_seq, expected_seq);
+    txindex.Stop();
+}
+
 BOOST_FIXTURE_TEST_CASE(txindex_collision_scan_path, TestChain100Setup)
 {
     // On-disk, so the legacy-entry probe at construction runs against a fresh

</details>

andrewtoth commented at 12:03 PM on July 27, 2026:

For the first objection, this is so we keep the existing behavior exactly. The barely any duplicates out of the total amount is not really the point, it's that transactions that exist in reorged blocks can be looked up without the block hash.

For the second objection, this doesn't really matter. A single lookup for an entire block is negligible compared to the cost of computing and writing tens of thousands of txid entries.

andrewtoth commented at 3:21 PM on July 27, 2026:

I made a simple change to read the sequence first, before doing the existence check. That way we can skip the existence check until the first reorg.

in src/index/txindex.h:1 in bd2834b47a outdated

l0rinc commented at 2:25 AM on July 27, 2026:

When testing on mainnet, I got ~860k 2-way collisions, and 1 3-way collision that worst case could cause an extra 2 false positives when reading.

The reported collision counts don’t seem internally consistent. With about 1.376 billion txids and (2^{40}) buckets, the expected number of colliding pairs is about 861k, which matches the measurement, but the expected number of buckets containing three txids should be about 359 according to the AI overlords.

andrewtoth commented at 3:21 PM on July 27, 2026:

Maybe I got lucky that time? I can remeasure another run with different salt values.

andrewtoth commented at 11:15 PM on July 29, 2026:

Remeasured on a new run.

1,402,773,589 prefixes with a single row (99.9362%)
894,549 duplicates (2-row buckets)
395 triplicates (3-row buckets)
1 four-way collision (4-row buckets)

l0rinc changes_requested

l0rinc commented at 2:50 AM on July 27, 2026: contributor

Approach looks good overall, but I’d still like us to avoid unnecessary rebuild reads, reconsider bloom filters, test and document downgrades, prevent the prefix probe’s filesystem side effects, and clarify the unexpectedly low three-way collision count.

andrewtoth force-pushed on Jul 27, 2026

DrahtBot added the label CI failed on Jul 27, 2026

DrahtBot commented at 3:34 PM on July 27, 2026: contributor

🚧 At least one of the CI tasks failed. Task iwyu: https://github.com/bitcoin/bitcoin/actions/runs/30279068381/job/90020500246 LLM reason (✨ experimental): CI failed because IWYU detected missing/unneeded includes and forced a failure (Fixing #includes in src/index/txindex.cpp → Failure generated from IWYU).

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

DrahtBot removed the label CI failed on Jul 27, 2026

andrewtoth commented at 5:14 PM on July 27, 2026: contributor

Thanks again @l0rinc. Addressed your suggestions.

in src/dbwrapper.h:284 in cba3441226

 278 | @@ -279,6 +279,11 @@ class CDBWrapper
 279 |       */
 280 |      bool IsEmpty();
 281 |  
 282 | +    //! Probe an unopened database for a key prefix. Return true if a database at
 283 | +    //! path exists and contains at least 1 entry beginning with prefix; missing,
 284 | +    //! unopenable, or empty databases return false.

l0rinc commented at 6:51 PM on July 27, 2026:

cba3441 txindex: skip bloom filters and legacy lookups for new databases:

Unopenable throws now:

    //! path exists and contains at least 1 entry beginning with prefix; missing
    //! or empty databases return false, and database errors throw dbwrapper_error.

l0rinc approved

l0rinc commented at 6:53 PM on July 27, 2026: contributor

LGTM, I'm measuring a few more scenarios as a last step, left some nits, I like how simple this version is now.

arejula27 commented at 7:56 PM on July 27, 2026: none

@optout21 asked upthread whether there is a benchmark that exposes the effect of this change. I have been adding index benchmarks (#35827) and ran this PR against its base with them.

Index on disk: −49% (114,378 -> 58,177 bytes), in line with the 66 -> 26 GB in the description.

Sync: +2.7% to +4.9% instructions, which is the SipHash.

Lookups, timing FindTx directly:

chain	keys	ins/lookup base → PR	Δ
50 blocks × 50 txs	~2.5k	33,305 -> 37,415	+12.3%
200 blocks × 100 txs	~20k	45,083 -> 83,805	+85.9%

The regression scales with the size of the index, so a small one underestimates it. The ~0.2ms getrawtransaction figure in the description goes through the RPC and the block-file read, which dominate at that scale and would hide this.

andrewtoth commented at 9:17 PM on July 27, 2026: contributor

@arejula27 thanks for this benchmark data. Is the third column supposed to be ns/lookup, so a lookup is 0.08 ms? Does this continue to scale for even larger databases 10s or 100s of MB in size?

A seek lookup is likely slower than a point read, but in this case the lookup time also needs to account for reading the tx from the block file. A larger tx of course will take longer to read and will dominate the lookup time.

I think a slight read regression is well worth the tradeoff for the reduction in disk size, sync speedup, and possibility for using in pruned mode. This is also the same pattern used in txospenderindex, which is an even larger database and shows that this lookup pattern is fast enough.

in src/index/txindex.cpp:202 in cba3441226 outdated

 198 | @@ -169,58 +199,58 @@ bool TxIndex::FindHashedTx(const Txid& tx_hash, uint256& block_hash, CTransactio
 199 |              LogError("OpenBlockFile failed for txid %s", tx_hash.ToString());
 200 |              continue;
 201 |          }
 202 | -        CTransactionRef candidate_tx;
 203 | +        CTransactionRef tx;

l0rinc commented at 6:34 AM on July 28, 2026:

cba3441 txindex: skip bloom filters and legacy lookups for new databases:

There's still quite a lot of changing back and forth, for future reviewer's I think we should minimize churn.

in src/index/txindex.cpp:216 in cba3441226 outdated

 219 | -    return false;
 220 | +    return std::nullopt;
 221 |  }
 222 |  
 223 | -bool TxIndex::FindLegacyTx(const Txid& tx_hash, uint256& block_hash, CTransactionRef& tx) const
 224 | +std::optional<TxIndex::TxIndexResult> TxIndex::FindLegacyTx(const Txid& tx_hash) const

l0rinc commented at 6:35 AM on July 28, 2026:

cba3441 txindex: skip bloom filters and legacy lookups for new databases:

same here, this could have been introduced at the split

l0rinc approved

l0rinc commented at 6:40 AM on July 28, 2026: contributor

I benchmarked this version against a fully synced mainnet datadir at height 958,859 on a Raspberry Pi 5 (4 Cortex-A76 cores, aarch64, ext4), see https://github.com/l0rinc/bitcoin/pull/244/commits

Index rebuild: A fresh rebuild produced the following size and wall-time results:

Size
before  ████████████████████████████  66.99 GiB
after   ██████████▓░░░░░░░░░░░░░░░░░  26.04 GiB  (-61.1%)

Rebuild wall time
before  ████████████████████████████  3h 26m 35s
after   ███████████████████████▒░░░░  2h 53m 03s  (-16.2%, 1.19x)

The rebuild speedup is lower than my earlier 1.60x Ryzen result (I'll redo that later) on an earlier revision, while the disk reduction remained consistent, so rebuild timing appears more platform- and revision-dependent.

Lookups: Each workload contains 90,545 RPC calls. Random hits sample active-chain transactions across the block files, misses use derived nonexistent txids, and repeated hits query the same txid throughout. The chart shows the mean of 10 runs after one warmup:

Mean wall time per lookup
random hits               before  ████████████████████████████  1.561 ms
                          after   █████████████████▒░░░░░░░░░░  0.979 ms  (-37.3%)

misses                    before  ████████████████▒░░░░░░░░░░░  0.929 ms
                          after   ██████████▒░░░░░░░░░░░░░░░░░  0.578 ms  (-37.8%)

repeated hit (same txid)  before  ███████▒░░░░░░░░░░░░░░░░░░░░  0.415 ms
                          after   ███████▒░░░░░░░░░░░░░░░░░░░░  0.415 ms  (~0%)

Prefix collisions: I scanned the full rebuilt index offline:

1,400,142,953 populated 5-byte prefixes
├── 1 row   1,399,250,896  99.936288%
├── 2 rows        891,681   0.063685%   (~891,489 expected)
└── 3 rows            376   0.0000269%  (~379 expected)

largest bucket: 3 rows

Overall, the size and random-lookup improvements are substantial, measured prefix-collision overhead is negligible, and only the rebuild speedup varied materially from my earlier result.

ACK 40c4684bda3950a2850f1988e76c2461067d403e

DrahtBot requested review from optout21 on Jul 28, 2026

DrahtBot requested review from theStack on Jul 28, 2026

DrahtBot requested review from sedited on Jul 28, 2026

arejula27 commented at 5:00 PM on July 28, 2026: none

Concept ACK

Is the third column supposed to be ns/lookup, so a lookup is 0.08 ms? Does this continue to scale for even larger databases 10s or 100s of MB in size?

No, ins/lookup means instructions, not time, it was the bigger so i though it was interesting, however now i think it might not be as relevant as i thought. Sorry for the inconvinence. I also re-ran the experiment to give you the times:

chain	keys	ns/lookup base → PR	Δ	ins/lookup base → PR	Δ
50×50	2.5k	6,800 → 6,763	−0.5%	33,303 → 37,349	+12.1%
200×100	20k	6,696 → 9,507	+42.0%	45,019 → 83,881	+86.4%

At 2.5k keys it's a wash in time despite +12% instructions; the cost only shows up in the 20k scenario. So it grows with the database.

That said, my two points are both regtest indexes that fit entirely in the block cache, so what they isolate is the CPU cost of the seek with no I/O involved. For a real scenario the @l0rinc's mainnet numbers (review) are the ones to look at, and they go the other way (though I think those are full-stack benchmarks over RPC, not just the FindTx call). So I'd read my +42% as the isolated seek overhead, not as a prediction for a real index. In any case, I wouldn't consider my experiment a blocker for this PR.

A seek lookup is likely slower than a point read, but in this case the lookup time also needs to account for reading the tx from the block file.

Agreed, this bench just profile the CPU as I mentioned before everything is cached at memory, i thought it might be interesting to see but i might be wrong.

I think a slight read regression is well worth the tradeoff

Agreed, I think the tradeoff is worth it, this "issue" is just noise as demonstrated up thread

I am doing a deeper review of the code, by the time I did not find anything relevant to say. Will post again when finished

in src/index/txindex.cpp:96 in 019e173194

  92 | +    uint32_t block_seq{0};
  93 | +    Read(txindex::DB_NEXT_BLOCK_SEQ, block_seq);
  94 | +
  95 | +    // A block that reconnects after a reorg keeps its original sequence number.
  96 | +    // No reorgs have occurred if sequence == block.height - 1, so we can skip the check.
  97 | +    if (block_seq != uint32_t(block.height - 1) && Exists(txindex::BlockHashKey{block.hash})) return;

mzumsande commented at 7:46 PM on July 28, 2026:

This is not only important in case of a reorg, but also in case of a unclean shutdown, maybe this should be mentioned too. I see that skipping is ok - even if the blocks are being re-downloaded and duplicates are written to the blk files, the index entries with its relative positions should still be correct, so no re-processing is necessary.

But the height-based skip looks a bit fragile to me, since the legacy index functionality can lead to the sequence (that counts the number of blocks indexed with v2) being smaller than the total number of blocks indexed: E.g. this slightly contrived scenario: Start with legacy index from genesis, mine block 1, restart with V2, mine another block (seq 0-> 1), unclean crash, restart, get block 2 again (e.g. from peers), and now Exists() will be skipped even if shouldn't, leading to the same block being indexed twice (with seq 0 and 1).

It's a bit hard to find things in the history, but is the reason for it pure optimisation, saving an Exist() call, or is there a functional reason too?

l0rinc commented at 8:18 PM on July 28, 2026:

It's a bit hard to find things in the history, but is the reason for it pure optimisation

Please see #35531 (review)

If people switch back and forth I don't think we should guarantee anything (especially not performance), it's why I suggested deleting the index before upgrading or downgrading. But reducing the leveldb reads would definitely be preferable - I'm currently benchmarking cacheing the next block sequence and the block sequence mappings to see how much it would speed up recreation and lookups.

andrewtoth commented at 11:05 PM on July 29, 2026:

The reason was pure optimization. I reverted to the previous version. I don't think it will have a noticeable performance cost. A single read for an entire block should be negligible.

in doc/release-notes-35531.md:3 in 40c4684bda outdated

   0 | @@ -0,0 +1,9 @@
   1 | +## Index
   2 | +
   3 | +- The transaction index (`-txindex`) now stores less data on disk; a fully

mzumsande commented at 8:11 PM on July 28, 2026:

nit: drop "stale block lookups" from commit msg

l0rinc commented at 8:19 PM on July 28, 2026:

with sequences we still have stale block lookups, but only in steady-state reorgs: recreating the index won't iterate stale blocks so if you had it before the reindex will get rid of them.

txindex: hash key prefixes and pack block positions

Use a 5-byte salted siphash to key txindex entries,
instead of the full 32-byte txid. Store the block sequence and tx position
after the hash in the key, so an iterator can scan
through any collisions and return the correct tx.

Fall back to the legacy key lookup if the tx is not found.

Co-authored-by: Pieter Wuille <pieter@wuille.net>
Co-authored-by: l0rinc <pap.lorinc@gmail.com>
Co-authored-by: Anthony Towns <aj@erisian.com.au>

d2c6cd6482

txindex: skip bloom filters and legacy lookups for new databases

New databases never contain legacy ('t' + txid) entries.
Peek at the database before opening it, and if no legacy
entries are found, skip building bloom filters (hashed
entries are only read via iterators, which do not consult
them) and return early from lookups instead of checking
for legacy entries.

9123dea7a8

tests: cover txindex hash prefix collisions and legacy fallback

Co-authored-by: l0rinc <pap.lorinc@gmail.com>

88c27410b9

doc: add release notes for txindex disk usage and downgrading 591abd58f8

andrewtoth force-pushed on Jul 29, 2026

andrewtoth commented at 11:08 PM on July 29, 2026: contributor

Thanks for the reviews @l0rinc @arejula27 @mzumsande.

Addressed nits regarding commit change ordering and commit messaging, reverted skipping the sequence check when indexing a new block, and updated some code comments for clarity.