Add an address-based transaction index

sipa commented at 10:50 pm on June 27, 2013: member

Changes:

Maintain a salt to perturbate the address index (protection against collisions).
Add support for address index entries in the block index, and maintain those if -addrindex is specified. It indexes the use of every >8-byte data push in output script or consumed script - or in case of no such push, the entire script.
Add a searchrawtransactions RPC call, which can look up raw transactions by address.

I both hate and love this patch. I hate making it easy to build infrastructure that relies on a fully-indexed blockchain (which shouldn’t be necessary), as it may remove the incentive to build more scalable solutions. On the other hand, in cases where the alternative is relying on a trusted third party, this approach is certainly preferable, and would allow an RPC-based blockexplorer, for example.

A full -txindex=1 -addrindex=1 index is about 2.7 GiB now.

mikehearn commented at 2:24 pm on June 28, 2013: contributor

I’m sure you’ll hate this idea even more, but why not expose searchrawtransactions via P2P and advertise it in a service bit? “fast import of a private key” is a pretty common feature request for SPV wallets. I don’t think it’s particularly useful myself, but apparently other people do and they go and use blockchain.info currently.

sipa commented at 2:37 pm on June 28, 2013: member

We’ve had this discussion. #2168

petertodd commented at 7:56 pm on June 28, 2013: contributor

There needs to be some way to find out how many entries are matched by a given address to allow blockexplorer apps’s to skip to the most recent transactions and display them first if requested. Failing that, at least have Python-style index ranges to allow you to start from the last entry. Yes, address reuse sucks etc. etc. but it’s a feature blockchain.info has and people will want it.

I like the search by PUSHDATA; nice to be able to search for 13MH4zmU4UT4Ct6BhoRFGjigC8gN9a9FNn and see all my multisig addresses. It may be worthwhile to also add scriptSigs to the database as well so that spent multisig P2SH’s can be identified.

EDIT: interesting, looks like we have a ExtractDestination() or similar failure with bare OP_CHECKMULTISIG…

sipa commented at 7:59 pm on June 28, 2013: member

@petertodd Thanks for catching those two comments.

There is a count and a skip argument, so you can paginate results (the database is queried each time, but the relevant transactions are only read from disk when actually requested).

petertodd commented at 8:01 pm on June 28, 2013: contributor

I know you can paginate; I’m saying that you need to be able to paginate in reverse direction. (my understanding is the results are in order of confirmations right?)

sipa commented at 8:07 pm on June 28, 2013: member

@petertodd They are in increasing position on disk, which for now corresponds to number of confirmations (except for transactions in sidechains - yes, those are returned too). I could just reverse the order, I guess.

One problem for the future is that when we’ll have headers-first sync and parallel block downloading, block order on disk won’t be necessarily consistent anymore with chain order. A solution is of course fetching all blocks, and sorting them by confirmations before pagination, but that’s a very significant overhead. Another solution is storing the height of each entry in the index…

petertodd commented at 8:10 pm on June 28, 2013: contributor

re: sidechains, interesting! That should be documented…

Gah, that does add a decent amount of complexity, although users are going to see it as important that they can see their latest satoshidice crap. :( I don’t have a great solution here, although I would lean towards storing height if that’s what it takes.

sipa commented at 8:13 pm on June 28, 2013: member

I don’t mind the complexity of adding that (though it probably adds several 100 MiB right now already), but I’m really against using an address index as a way to track new payments (which is what this basically boils down to, right?) - you should just have a wallet that watches the chain for that…

petertodd commented at 8:15 pm on June 28, 2013: contributor

Well, maybe this index is actually slightly premature, and what we need more immediately is a way to search just the UTXO set?

petertodd commented at 8:17 pm on June 28, 2013: contributor

Or, think in terms of searchrawtransactions should have UTXO-only and full-chain modes?

sipa commented at 8:17 pm on June 28, 2013: member

@petertodd What I want it for is an easy local blockexplorer. And the nice thing about the block tree (and all its indexes) is that they’re append only, which is easy implementation-wise.

I agree an address index to the UTXO is useful too, and probably less controversial. It’s not implemented however :)

sipa commented at 8:22 pm on June 28, 2013: member

@petertodd Also, searching the UTXO set by address would have a very different interface anyway, as it’s a set of transaction outputs, not a set of transactions. It wouldn’t make sense in the same RPC command, IMHO.

petertodd commented at 8:41 pm on June 28, 2013: contributor

@sipa Good point.

Well, seems to me that for a blockexplorer simply being able to iterate forward and reverse should be enough for the UI, and at worse you can add the height index later when the order guarantee breaks. In that case:

0searchrawtransactions <address> [skip=0] [count=100] [verbose=1]
1
2Return count transactions with <address> present in their scriptSig,
3skipping skip at the beginning. The ordering is oldest transaction first;
4if skip is negative the order returned is newest transaction first and skip+1
5transactions are skipped. If verbose=0 only txids are returned rather than
6the full transactions.

luke-jr commented at 9:28 pm on June 28, 2013: member

How about using a 64-bit hash of the full script instead?

sipa commented at 9:31 pm on June 28, 2013: member

Use the source, Luke!

sipa commented at 9:22 am on June 29, 2013: member

@luke-jr More seriously, every index entry adds around 10 bytes, and given the current amount of address reuse, those constitute the majority of the database. I’ve considered adding the entire script too, but that would mean a 50% increase (or more) because of that reason. Plus, if you know the full script, just pick the largest data push in it, and search for that.

sipa commented at 11:26 am on June 29, 2013: member

@petertodd There is an alternative implementation possible, where we store (height, txoffset) for each index entry instead of (filenum, blockoffset, txoffset) - That’s smaller too, and allows consistent ordering. The downside is that it can’t support side-chain matches, as heights are not unique.

mikehearn commented at 4:07 pm on June 29, 2013: contributor

Yes, I know we had the discussion before. Once again, I’ll remind us all that we don’t have any real power in this. If people can’t get the features they need in a decentralised way they’ll create a centralised way instead, hence, blockchain.info API. The idea that people will say “oh there’s no P2P command to read a blockchain index, guess I won’t write that app after all” doesn’t seem right to me.

luke-jr commented at 4:38 pm on June 29, 2013: member

@sipa I meant the 64-bit hash instead of the current 64-bit key. Any address can be converted to a script and hashed just fine.

johndillon commented at 4:39 pm on June 29, 2013: none

@mikehearn If you can’t offer a feature in a decentralized way with reasonably low resource consumption you should be happy that centralized services pop up and offer it instead.

If you want to be useful design an API that allows you to pay via micro-transactions those resources you are using by querying a node that has gone to the expense of using @sipa’s code to maintain a blockchain index. It would be easy to add this as a service bit and use preferential peering to make it possible for nodes to find peers supporting that API. Such a system would still be decentralized and resource consumption would be paid for in a fair and equitable way.

You do after have a brand new micro-transactions system…

sipa commented at 4:43 pm on June 29, 2013: member

@luke-jr That doesn’t allow you to query pay-to-pubkeys given the corresponding address.

sipa commented at 4:47 pm on June 29, 2013: member

@mikehearn I’ll agree to have an address index exposed to the P2P network if it can be done in an authenticated way, like Alan Reiner’s proposal for exposing an address-indexed committed merkle tree, and even then only for the UTXO set, and not the entire history.

I’m completely opposed to providing any service on the P2P network that requires the entire history being available, except bootstrapping a new full node.

johndillon commented at 4:48 pm on June 29, 2013: none

@sipa My preference would be to support the full (filenum, blockoffset, txoffset) set myself. Whatever the additional space used is it doesn’t seem like a big deal once you’ve decided to go to the effort of creating the index in the first place. Hard-drive space is cheap.

Incidentally, it does speak to how it would be useful to be able to iterate over every block stored, including orphans, and be able to deliberately add orphans to your database even after the fact. Heh, timestamp your orphan blocks and it would even be reasonable to maintain a set of every orphan ever created after the fact while allowing users to submit new blocks to this database even far into the future. (the timestamp is the anti-spam measure)

luke-jr commented at 4:50 pm on June 29, 2013: member

@sipa Neither does the specialized form you’re suggesting…?

johndillon commented at 4:52 pm on June 29, 2013: none

@luke-jr It does because every PUSHDATA > 20 bytes is indexed by first computing Hash160(data).

Even individual multisigs in a bare OP_CHECKMULTISIG can be searched for as @petertodd pointed out.

petertodd commented at 5:01 pm on June 29, 2013: contributor

@sipa I’ll second @johndillon’s thoughts on the full (filenum, blockoffset, txoffset) index.

Clever idea re: an orphan database… It’d be useful to have an index of all children for a given block too, but that can be a different pull-req; by that point we’ll have a -all-blockchain-indexes flag…

luke-jr commented at 5:06 pm on June 29, 2013: member

Aha, missed that part.

sipa commented at 5:12 pm on June 29, 2013: member

@petertodd To have a consistent ordering, we’d need (filenum, blockoffset, txoffset, height) even.

gastonmorixe commented at 3:52 am on July 5, 2013: none

This is awesome :+1: please, merge it

sipa commented at 11:05 pm on July 13, 2013: member

Rebased, and added a stable ordering (height-based) and negative offsets, to simplify backward pagination.

in src/rpcrawtransaction.cpp: in 5df5288225 outdated

173+    while (it != setpos.end() && nCount--) {
174+        CTransaction tx;
175+        uint256 hashBlock;
176+        if (!ReadTransaction(tx, *it, hashBlock))
177+            throw JSONRPCError(RPC_DESERIALIZATION_ERROR, "Cannot read transaction from disk");
178+        CDataStream ssTx(SER_NETWORK, PROTOCOL_VERSION);

petertodd commented at 9:33 am on July 15, 2013:

You’re missing a » here; setting fVerbose to false displays nothing.

Also, change fVerbose to be true/false? Seems to be what we’re doing elsewhere.

sipa commented at 10:24 am on July 15, 2013:

Indeed, a « is missing (it’s serializing, not deserializing).

For fVerbose, I’m keeping consistency with getrawtransaction, which also uses 0/1. The default here is decoding though, as a human user will likely be more interested in a decoded version.

in src/rpcrawtransaction.cpp: in 5df5288225 outdated

130@@ -131,6 +131,65 @@ void TxToJSON(const CTransaction& tx, const uint256 hashBlock, Object& entry)
131     }
132 }
133 
134+Value searchrawtransactions(const Array &params, bool fHelp)
135+{
136+    if (fHelp || params.size() < 1 || params.size() > 2)

petertodd commented at 9:35 am on July 15, 2013:

Should be params.size() > 4

sipa commented at 10:24 am on July 15, 2013:

Thanks, indeed.

gastonmorixe commented at 10:19 am on July 15, 2013: none

I agree, verbose is not working.

BitcoinPullTester commented at 6:06 am on July 20, 2013: none

Automatic sanity-testing: PASSED, see http://jenkins.bluematt.me/pull-tester/5df5288225dce4bd4c6c72254736307c02a6cd78 for binaries and test log. This test script verifies pulls every time they are updated. It, however, dies sometimes and fails to test properly. If you are waiting on a test, please check timestamps to verify that the test.log is moving at http://jenkins.bluematt.me/pull-tester/current/ Contact BlueMatt on freenode if something looks broken.

Encapsulate CLevelDB iterators cleanly fd867c7dbc

Add address-based index

1) Maintain a salt to perturbate the address index (protection against
   collisions).
2) Add support for address index entries in the block index, and
   maintain those if -addrindex is specified. It indexes the use of
   every >8-byte data push in output script or consumed script - or in
   case of no such push, the entire script.
3) Add a searchrawtransactions RPC call, which can look up raw
   transactions by address.

4790f3c823

sipa commented at 1:06 pm on July 20, 2013: member

Rebased and fixed the bugs reported by @petertodd.

BitcoinPullTester commented at 0:01 am on July 21, 2013: none

Automatic sanity-testing: PASSED, see http://jenkins.bluematt.me/pull-tester/4790f3c823a33fae44b82ef7962372e38b1b0131 for binaries and test log. This test script verifies pulls every time they are updated. It, however, dies sometimes and fails to test properly. If you are waiting on a test, please check timestamps to verify that the test.log is moving at http://jenkins.bluematt.me/pull-tester/current/ Contact BlueMatt on freenode if something looks broken.

sipa commented at 7:43 pm on July 26, 2013: member

@petertodd Feel free to test the pageability now.

petertodd commented at 8:59 pm on July 26, 2013: contributor

@sipa Will do after my litecoin audit’s done.

petertodd commented at 5:27 am on July 31, 2013: contributor

searchrawtransactions 1111111111111111111114oLvT2 doesn’t work - looks like the issue is the first few lines in FindTransactionsByDestination() because the keyid is 0, which is correct in that case.

petertodd commented at 5:39 am on July 31, 2013: contributor

I’m having good luck with the code otherwise - matches results on blockchain.info, and non-standard oddities like searching for OP_RETURN data work fine. Also results returned appear to all be in correct order.

petertodd commented at 6:17 am on July 31, 2013: contributor

Getting some weird results with mwy5FX7MVgDutKYbXBxQG5q7EL6pmhHT58=Hash160(’’) on testnet, and the equivalent on mainnet. I’m seeing testnet txids a1f6a4ffcfd7bb4775790932aff1f82ac6a9b3b3e76c8faf8b11328e948afcca and 75f7d5e99912875e88d667afb48021b0b74916539c518618a8db4966661509df returned in the results - the former has a standard scriptPubKey, and a nonstandard scriptSig of “1”, or hex “51”. (IE push the number 1 to the stack) The latter has a totally empty scriptPubkey. Opcode 51 isn’t within the range of opcodes 0 to PUSHDATA4, so BuildAddrIndex should have gone to the !fHaveData branch, but that returns Hash160(script), which is definitely not empty.

A search for a script of just opcode 52, Hash160(’\x52’) doesn’t show this problem, (returns empty) and searching for script’s without pushdata’s like Hash160(’’\x61’) (op_nop) also works as expected. Oddly though Hash160(’\x51\x51’), two pushes of the constant 1 to the stack, doesn’t work and returns nothing even though there is a script exactly equal to that.

I’ll be honest, I’m a bit mystified, although I haven’t delved into a debugger yet.

gastonmorixe commented at 6:41 am on July 31, 2013: none

I just wanted to thank all of you for this implementation. Specially sipa who hates it and he did wrote it :)

Here you can see my work in progress from my Rails App, I do all the parsing and get the balances. Thank you all.

http://cl.ly/QNUL

Ton

gavinandresen commented at 6:37 am on August 13, 2013: contributor

ACK from me. I think we should pull; edge-case bugs with weird, non-standard transactions I don’t care about.

gmaxwell commented at 7:00 am on August 13, 2013: contributor

Only reason I see to delay the pull is that any fixes to address misindexing pointed out above may require a complete reindex.

petertodd commented at 9:05 am on August 13, 2013: contributor

I disagree given the reason why these tx’s aren’t working isn’t understood yet - could be a symptom of a subtle bug.

jgarzik commented at 2:59 am on August 25, 2013: contributor

ACK

ghost commented at 11:28 am on October 21, 2013: none

Would we be able to see this merged? There’s no outstanding issues in it (I’ve run it for literally months now), and there’d be considerable benefit in having this in the master.

sipa commented at 11:31 am on October 21, 2013: member

Seems there is a not-understood problem at least (as reported by @petertodd), and I don’t plan to work on this any time soon.

sipa commented at 11:37 am on October 21, 2013: member

In any case, I want to have watch-only wallet support before this, as it is a much more scalable solution for many problems that you would use an address-based index for. I’m closing this for now.

sipa closed this on Oct 21, 2013

ghost commented at 12:24 pm on October 21, 2013: none

This did however allow for some quite handy local block explorers. Between Abe and BlockExplorer’s respective database messes, you’re looking 100+ GB of external databases and literally weeks of CPU time. Those not being an option, I’m back forced to query Blockchain.info and Blockexplorer for my data; which is slow and somewhat demonstrates the links between the various addresses to a third party. It also means that I’m reliant on two extremely unstable and latent services for something which -addrindex would have allowed locally.

I realise that I can still just continue on using -addrindex as my own patch, but it would be of benefit to others to have something like it available in the main builds for developers who aren’t aware it exists.

sipa commented at 12:28 pm on October 21, 2013: member

Yes, I understand - that was mostly the reason for writing it. If the choice is between people using centralized indexing services and being able to run their own, I certainly would encourage the latter. But on the other hand, if the choice is between people building infrastructure that relies on such indexes being available for wallet services (preventing pruning later on), I certainly would encourage watch-only wallets.

In any case, I’m just closing it to clean up the pull request list. I’ll probably get back to it at some point (or someone else may).

laanwj commented at 12:36 pm on October 21, 2013: member

@63 if you care a lot about this, why not pick it up yourself and try to resolve the remaining issue and submit a pull?

jmcorgan commented at 7:41 pm on February 11, 2014: contributor

I’ve refreshed this branch to work with current master and have issue a new pull request #3652.

dexX7 referenced this in commit 402f643811 on Jun 16, 2014

dexX7 referenced this in commit 5095a0cf49 on Jun 16, 2014

Bushstar referenced this in commit 5057ad511c on Apr 5, 2019

DrahtBot locked this on Sep 8, 2021

Add an address-based transaction index #2802