[pull request idea] addressindex, spentindex, timestampindex (Bitcore patches)

karelbilek commented at 12:19 pm on May 9, 2017: contributor

We (SatoshiLabs) use bitcore for our backend; and bitcore depends on a fork of bitcoin-core, that adds additional indexes and some other functionality.

Since bitpay stopped updating the fork to newest bitcoin, effectively making bitcore unable to see segwit, we keep rebasing the code to the latest releases. So I thought it wouldn’t be a bad idea to try to actually put it into upstream :)

What the code does is adding additional indexes into bitcoin-core that enable to run full block explorer, like insight, and also adds some RPC calls for those indexes. I think it would be beneficial to the ecosystem at large to have these RPC calls/indexes available in upstream.

Most of the work has been done by @braydonf ; I just rebased the code (first to 0.14.* here - https://github.com/satoshilabs/bitcoin - and on master in this PR) and tested it with segwit. (The original, outdated bitcore code is at https://github.com/bitpay/bitcoin )

This is not meant as a definitive PR - it is too big and needs to be cleaned up and broken into smaller parts (it’s obvious it has been rebased several times) - but more as a request for discussion; I want to know if I should start spending time with cleaning the commits up etc. or if you don’t want this functionality in the bitcoin-core at all.

The added RPCs are:

getblockdeltas
getblockhashes
getaddressmempool
getaddressutxos
getaddressdeltas
getaddressbalance
getaddresstxids
getspentinfo

These calls are sufficient to build block explorer.

indexes: adds additional address, spent and timestamp indexes

Adds new bitcoin.conf configuration options for three new indexes:
-addressindex=1
-spentindex=1
-timestampindex=1

The addressindex records all changes to an address for retrieving txids, balances and
unspent outputs for addresses. Changes are stored and sorted in block order. Both p2sh
and p2pkh address types are supported. The index records two sets of key/value pairs.
The first records all activity and is useful for viewing transaction history and
all changes. The second is specifically for retrieving unspent outputs by address, and
is smaller as values are removed once they are spent.

The spentindex has multiple purposes and brings closer together inputs and outputs of
transactions. The main purpose is to efficiently determine the address and amount of
an input's previous output. The second purpose is to be able to determine which
input spent an output.

The timestampindex keeps track of timestamps with block hashes and is useful for searching
blocks by date instead of by height. This is useful for a block explorer that will give
search options by date. The index uses logical time correction to make sure that the
results are sorted in block order. The logical time of a block is actual timestamp of the
block, unless it is less than (earlier) the previous block's logical time, and in that
case it is one second greater than the previous block's logical time.

Includes logical time fix by Chethan Krishna

Conflicts:
	src/main.cpp
	src/main.h
	src/txdb.cpp
	src/txdb.h
	src/txmempool.h

2045ffebe1

rpc: adds rpc commands for address, spent and timestamp indexes

Adds new rpc commands that use the address, spent and timestamp indexes,
including the new commands:

- getblockdeltas
- getblockhashes
- getaddressmempool
- getaddressutxos
- getaddressdeltas
- getaddressbalance
- getaddresstxids
- getspentinfo

And modifications to the command:
- getrawtransaction

Conflicts:
	src/rpc/blockchain.cpp
	src/rpc/client.cpp
	src/rpc/misc.cpp
	src/rpc/rawtransaction.cpp
	src/rpc/server.cpp
	src/rpcserver.h

65cbc4c319

tests: adds rpc tests for address, spent and timestamp indexes

Tests the functionality of the indexes as well as the rpc commands

74241141ba

tests: adds unit test for IsPayToPublicKeyHash method 9ae322d06d

indexes: refactoring and fixes applying changes to 0.13 1b99ecb091

indexes: additional logging and checks for indexes a74235ac4b

tests: update rpc index tests for 0.13 95d72be5a5

db: add options to configure block index database

There was a previous assumption that blockindex would be quite small. With addressindex
and spentindex enabled the blockindex is much larger and the amount of cache allocated for
it should also increase. Furthermore, enabling compression should decrease the amount of
disk space required and less data to write/read. The default leveldb max_open_files is set to
1000, for the blockindex the default is set to 1000 with compression. The 64 value that is
current is kept for the utxo database and does not enable compression. Two additional options
are added here to be able to configure the values for leveldb and the block index:

- `-dbmaxopenfiles` A number of files for leveldb to keep open
- `-dbcompression` Boolean 0 or 1 to enable snappy leveldb compression

Conflicts:
	src/dbwrapper.cpp
	src/init.cpp

c7f2d82518

tests: test dbwrapper options compression and maxopenfiles bc474a169e

tests: include and fix txindex test with rpc tests 06726dd413

build: include timestampindex.h in makefile 74a85014d2

Bitcore 0.14.0 fixes 7a4ad0c710

karelbilek commented at 12:28 pm on May 9, 2017: contributor

The tests seem to be failing on some rebased code.

But as I said, I added this PR as a general discussion request. If there is an interest, I will clean this up.

jonasschnelli commented at 1:35 pm on May 9, 2017: contributor

Having an independent address-index available somewhere may be handy and in general a great thing. Though, I’m not sure if we want to add this to the Bitcoin-Core repository. It already contains (too) much. Wallet, utilities, txindex, UI, p2p, consensus.

I would prefer to keep the address-index (and all other sorts of indexes) in other repositories. Since Core offers ZMQ, creating an address-index via a python script/application (with it’s independent leveldb dababase/layer) should not be very complex. The python application could then handle the additional RPC commands.

If there are missing ZMQ features (I guess there is still an open PR for mempool ZMQ notifications), we should rather add those instead of the indexes.

But I’m open for other opinions.

karelbilek commented at 1:58 pm on May 9, 2017: contributor

I am not sure about the exact reasoning (cc @braydonf again), but older versions of Bitcore (< 3) had something similar - unpatched bitcoin core with additional indexes in node.js (see https://github.com/bitpay/bitcore-node/tree/v2.1.1/lib/services ) - but in Bitcore 3, they instead moved to including the address index in the bitcoin core itself.

Perhaps it is easier to maintain one database over 2 separate databases? (I am guessing here)

dcousens commented at 2:04 pm on May 9, 2017: contributor

@runn1ng have you looked at https://github.com/bitcoinjs/indexd? It has all the indexes listed above, and it works seamlessly with an unpatched bitcoin-core RPC.

karelbilek commented at 2:10 pm on May 9, 2017: contributor

@dcousens thanks for noticing me! I will definitely look at that.

braydonf commented at 2:13 pm on May 9, 2017: none

There was a significant performance gain by building the indexes at the same time as the txindex, since each block only needed to be read and deserialized once.

TheBlueMatt commented at 2:19 pm on May 9, 2017: member

See-also #9806 (and #8660 and #5048). I think the jury might still be out on whether we want more indexes supported in Bitcoin Core.

dcousens commented at 2:20 pm on May 9, 2017: contributor

@braydonf at ~400ms per 1MB block… that really shouldn’t be an issue - and the above is Javascript. If you’re concern is related to the initial import, tools like https://github.com/dcousens/fast-dat-parser can dump the leveldb as fast as it can push IO.

IMHO I don’t think bitcoind should support any further indexes…

braydonf commented at 4:55 am on May 10, 2017: none

In the past, we had implemented indexes in several ways, as mentioned, and found this to be the best option. I do think additional indexes, if needed, should be implemented within a full/spv node, the closeness reduces the overall complexity and duplication of disk and CPU usage. It otherwise ends up being what is effectively two nodes next to each other. The point that I realized this was when it was necessary to keep what is effectively another UTXO set for determining the address for inputs.

As far as the utility of the indexes, and if they should be included in Bitcoin Core: I’d like to see the ability for an address index to be pruned (not included in this current implementation). An address index of the UTXO set and memory pool only, for example, may enable a pruned Bitcoin QT wallet to not need to rescan (and download) all blocks when adding new keys to a wallet. This could support many different hardware/offline wallets, with addresses and keys that the full node may not be aware. And in a way that doesn’t depend on remote servers. This could also be useful for accepting bitcoin payments in an environment that the keys may change frequently e.g. on a web server, while also reducing the setup costs.

As far as specific implementation details here, there may be some work here still: The indexes may ideally be implemented with a B/B+ tree in C (e.g. LMDB, WiredTiger, and etc.), the LSM tree of LevelDB is useful for smaller write heavy sets of data that are not queried often. The write amplification that happens in block compaction, especially when the database reaches 100+ GB can become problematic. This implementation of indexes is impacted by this issue, however it’s mostly revolved by running on SSD and could be helped by running manual compaction (also not implemented here). That same issue may eventually affect the UTXO database as it grows in size.

dcousens commented at 5:25 am on May 10, 2017: contributor

the closeness reduces the overall complexity and duplication of disk and CPU usage

If you are using a pruned or SPV node, you really shouldn’t be suffering from disk duplication problems. The CPU usage for an index is near completely negligible compared to your IO. You don’t need to verify anything if you trust your data source (your node).

An address index … [would be optimal] when adding new keys to a wallet

Personally, I would only concept ACK this if it was completely contained to src/wallet.

As far as I’m concerned, bitcoind is my source of truth. Any “indexes” are purely convenient views of that truth - and therefore as a matter of single responsibility - they should be isolated.

I don’t see how there could be any significant IO inequal to that as if it was inlined within bitcoind - short of your local pipe and RPC overhead - which, as far as I have measured - is completely negligible compared to the disk IO. If any effort was to be made in the name of “performance”, it should be focused on a high performance binary pipe, not just moving cruft into bitcoind for the sake of convenience.

…reducing the setup costs…

I agree that the lack of convenient setup/configuration for these services is not great, but that doesn’t mean we have to merge it all into a single binary.

The write amplification that happens in block compaction, especially when the database reaches 100+ GB can become problematic

I don’t see how an inline index would alleviate this.

junderw commented at 10:19 am on May 10, 2017: contributor

Allowing Bitcore-Wallet-Service to run on top of a vanilla bitcoind + bitcore (no patched binaries needed during install) would be extremely useful.

priestc commented at 12:41 pm on May 16, 2017: none

I just want to say that a UTXO database index in bitcoin core is a very good idea in my opinion.

Keep in mind also there is BIP131, which requires a index on the UTXO set to be implemented. If BIP131 gets implemented, the UTXO database no longer becomes optional (as it will be required to validate new transactions), but since the UTXO index is relatively tiny compared to the txindex I don’t think its a problem.

In the future, I see the wallet features and UI elements removed from bitcoin core, and backend stuff like indexes taking it’s place.My opinion is we should go hog wild and add as many (optional) indexes as conceivable.

jtimon commented at 11:19 pm on July 18, 2017: contributor

As said in #9806 I’m not opposed to an additional optional index similar to -txindex. For example “-scriptpubkeyindex”. But instead I think it’s more generic to use the scriptPubKey directly as key instead of addresses. The translation from address to script is simple for p2pkh (and other script templates) can be done externally or maybe internally too as an additional rpc call.

The minimum necessary for an explorer I think would be -scriptpubkeyindex plus a call searchscriptpukey that takes the hex of a scriptpubkey and returns a list of outpoints (perhaps including the block hash, see #10275 ). From there, one can call getrawtransaction and/or gettxout can be used. But this is just the minimal design, not the optimal or more convenient. At the same time, I think the list currently in the OP is too much.

But for the block explorer use case I assume no pruning and serving spent transactions too, which will result in a big index and code that I’m sure some people won’t be very happy to help maintain. Even if that’s the case, I’m interested in this use case and I may be interested in rebasing the same resulting branch from major version to major version. I want to note that “I run my own explorer locally for privacy reasons” is a perfectly valid use case, not only for this index, but also for non pruned full nodes in general (and we shouldn’t disregard potential reasons for users to want to run archival nodes, beyond “helping the network”).

Starting with a branch that only indexes the unspent scriptpupkeys as @braydonf suggests may be much more acceptable (and whether or not the explorer part is acceptable, the diff to maintain will be much smaller if that gets accepted). Not only because the index would be much smaller and because it doesn’t conflict with pruning in any way, but also because the use case of removing the need for rescan is much more compelling.

If you are using a pruned or SPV node, you really shouldn’t be suffering from disk duplication problems.

Sure, the explorer and norescan are different use cases. Even non pruned full nodes could benefit from the latter, for example, allowing to import private keys leveraging the index, maybe with an additional norescan option that allows you to only look in the scriptpubkeyUtxoIndex and completely avoid a rescan to get info about already spent outputs that key once controlled.

You don’t need to verify anything if you trust your data source (your node).

Pruned or not, you don’t need to trust your data source if you are already a full node. I think the SPV use case is just adding confusion, the utxo-only index can be very useful without taking SPV nodes into account at all.

I agree that the lack of convenient setup/configuration for these services is not great, but that doesn’t mean we have to merge it all into a single binary.

Absolutely, adding the index doesn’t mean we should also add every potential call that would make use of it. Even if that would be more convenient, marginal use cases can perhaps go through less direct routes like implementing some things on their side or calling the rpc twice instead of once to get some data they want, which is may not be a great penalty in performance depending on the case.

karelbilek commented at 10:26 pm on August 17, 2017: contributor

I am closing this PR. We are running bitcoin with these custom patches on our servers and while it is working so far, it is not very maintainable long-term and the leveldb performance is subpar for the addressindex.

External index is the only solution.

karelbilek closed this on Aug 17, 2017

DrahtBot locked this on Sep 8, 2021

[pull request idea] addressindex, spentindex, timestampindex (Bitcore patches) #10370