rpc: Add getblocklocations call #20702

romanz commented at 12:33 PM on December 18, 2020: contributor

This RPC allows the client to retrieve the file system locations of the confirmed blocks and their undo data, to allow building efficient indexes outside of Bitcoin Core.

An example usage is described here: https://github.com/romanz/electrs/issues/308

By using the new RPC, it is possible to build an address-based index taking ~24GB and a txindex taking ~6GB (as of Dec. 2020).

romanz renamed this:
~~rpc: Add getblocklocations~~
rpc: Add getblocklocations call
on Dec 18, 2020

laanwj commented at 1:26 PM on December 18, 2020: member

I expect this to be a difficult sell because we generally don't treat any of the data files as a stable external interface. Pointers to the file system can be considered implementation details subject to change.

But on the other hand, the format of the block files specifically hasn't ever significantly changed, and is (I think) unlikely to change.

To be clear I definitely see when this can be useful. For example the size of our own contrib/linearize tool could be cut in half by using this.

romanz force-pushed on Dec 18, 2020

DrahtBot added the label RPC/REST/ZMQ on Dec 18, 2020

DrahtBot commented at 4:58 PM on December 18, 2020: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#20012 (rpc: Remove duplicate name and argNames from CRPCCommand by MarcoFalke)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

Kixunil commented at 7:08 PM on December 18, 2020: none

Perhaps there are ways to mitigate this issue? I could see two situations in which block data move around:

arbitrarily during run time - this seems incredibly unlikely outside of reorgs (reindexing needs to be done in case of reorgs anyway)
on upgrade - upgrade may take longer time because of this; similar things already happened to indexes

A simple way to deal with these is to have some kind of "block layout version" that can be obtained over RPC. If the layout is ever changed, the version is bumped and indexers will need to reindex. As described above, this is most likely in case of a long upgrade needs to be done anyway, so it wouldn't be too annoying.

To make it a bit more future-proof we could have two version numbers: one for the format of blocks already on disk, another for format in general. If bitcoind is updated, the existing blocks are not changed, but future blocks may have some differences then the former version number doesn't change the latter does - avoids reindex when not actually needed and also avoids indexers that don't support a new version blindly attempting to process a layout they don't understand.

I believe it'd be best to introduce this versioning right from the start. (Just return constant zeroes for now.)

If it ever happens that a single block (or a few blocks) is moved around, then doing full reindexing would be inefficient but I assume it'd be sensible to not try to solve it right now. That means let's not do something excessive like adding a layout version to each block now.

in src/rpc/blockchain.cpp:2477 in 6e61171710 outdated

2469 | @@ -2470,6 +2470,60 @@ static RPCHelpMan dumptxoutset()
2470 |      };
2471 |  }
2472 |  
2473 | +static RPCHelpMan getblocklocations()
2474 | +{
2475 | +    return RPCHelpMan{"getblocklocations",
2476 | +                "\nEXPERIMENTAL warning: this call may be removed or changed in future releases.\n"
2477 | +                "\nReturns a JSON for the location of 'blockhash' and its previous 'nblocks' block and undo data.\n",

Kixunil commented at 7:09 PM on December 18, 2020:

@romanz do you have any plan for what to do with electrs if this is removed in the future?

romanz commented at 10:11 AM on December 19, 2020:

If Electrum protocol will allow using the actual scriptPubKey for querying the historical transactions, it would be possible to use blockfilterindex-based RPC, such as #20664 (as suggested here). Maybe it can be supported with bwt if the full node has blockfilters? CC: @SomberNight @shesek

in test/functional/rpc_getblocklocations.py:35 in 6e61171710 outdated

  26 | +            {'file': 0, 'data': 1081, 'undo': 131, 'prev': '602403f82060e224423a9f22062fad3f5333beeee3c30313f6ea808e37b7e0b2'},
  27 | +            {'file': 0, 'data': 821, 'undo': 90, 'prev': '334ab00aba5d213c8f67161ddff346c4643d0709c3bdafa4b4b05fe6f7e4ed48'},
  28 | +            {'file': 0, 'data': 561, 'undo': 49, 'prev': '43f10598f19eced9c514f5ae40dbce0ab101362a22e18820901c6e03d7babe0b'},
  29 | +            {'file': 0, 'data': 301, 'undo': 8, 'prev': '0f9188f13cb7b2c71f2a335e3a4fc328bf5beb436012afca590b1a11466e2206'},
  30 | +            {'file': 0, 'data': 8, 'prev': '0000000000000000000000000000000000000000000000000000000000000000'},  # genesis block
  31 | +        ]

Kixunil commented at 7:21 PM on December 18, 2020:

Perhaps it'd make more sense to test that the blocks stored on the returned offsets contain the same data as when using getblock call?

romanz commented at 10:12 AM on December 19, 2020:

Sounds good, will do.

romanz commented at 10:50 AM on December 19, 2020:

Done at https://github.com/bitcoin/bitcoin/pull/20702/commits/9d5c2bb85c2601ad54fa9df49b4f13e0490fb421.

in src/rpc/blockchain.cpp:2480 in 6e61171710 outdated

2475 | +    return RPCHelpMan{"getblocklocations",
2476 | +                "\nEXPERIMENTAL warning: this call may be removed or changed in future releases.\n"
2477 | +                "\nReturns a JSON for the location of 'blockhash' and its previous 'nblocks' block and undo data.\n",
2478 | +                {
2479 | +                    {"blockhash", RPCArg::Type::STR_HEX, RPCArg::Optional::NO, "The block hash"},
2480 | +                    {"nblocks", RPCArg::Type::NUM, RPCArg::Optional::NO, "Number of block locations to return"},

Kixunil commented at 7:22 PM on December 18, 2020:

I'd like to see this description clarified. Something like: "The number of blocks to process, including the block with given blockhash. The blocks following the given block will be returned if bigger than one."

I'm not entirely satisfied with my wording either, so will be happy for clearer version. The idea is to communicate what exactly is in the result.

romanz commented at 10:49 AM on December 19, 2020:

Updated the documentation:

getblocklocations "blockhash" nblocks

EXPERIMENTAL warning: this call may be removed or changed in future releases.

Returns a JSON for the file system location of 'blockhash' block and undo data.

It is possible to return also the locations of previous blocks, by specifying 'nblocks' > 1.

Arguments:
1. blockhash    (string, required) The block hash
2. nblocks      (numeric, required) Maximum number locations to return (up to genesis block)

Result:
[           (json array)
  n,        (numeric) blk*.dat/rev*.dat file index
  n,        (numeric) block data file offset
  n,        (numeric) undo data file offset (if exists)
  "hex",    (string) previous block hash
  ...
]

Examples:
> bitcoin-cli getblocklocation "00000000c937983704a73af28acdec37b049d214adbda81d7e2a3dd146f6ed09" 10

WDYT?

Kixunil commented at 4:16 PM on December 19, 2020:

Looks good. :+1:

in src/rpc/blockchain.cpp:2508 in 6e61171710 outdated

2498 | +    uint256 hash(ParseHashV(request.params[0], "blockhash"));
2499 | +    size_t nblocks = request.params[1].get_int();
2500 | +
2501 | +    const CBlockIndex* pblockindex = WITH_LOCK(cs_main, return LookupBlockIndex(hash));
2502 | +    if (!pblockindex) {
2503 | +        throw JSONRPCError(RPC_INVALID_ADDRESS_OR_KEY, "Block not found");

Kixunil commented at 7:25 PM on December 18, 2020:

I suppose this is hit (also) when the block is pruned? Perhaps document it?

romanz commented at 10:48 AM on December 19, 2020:

Good point - added a check in https://github.com/bitcoin/bitcoin/pull/20702/commits/9d5c2bb85c2601ad54fa9df49b4f13e0490fb421#diff-decae4be02fb8a47ab4557fe74a9cb853bdfa3ec0fa1b515c0a1e5de91f4ad0bR2506.

craigraw commented at 7:12 AM on December 19, 2020: none

Supporting efficient indexes outside of Bitcoin Core is a critical privacy improvement. The Bitcoin Core wallet must keep the wallet's public keys unencrypted in order to track them, meaning that a compromise of the node is a complete compromise of the wallet privacy.

A full index of the kind that electrs provides allows such knowledge of the public keys to be restricted to the period that the wallet is being accessed, and then in memory only. This reduces the attack surface considerably and greatly improves potential privacy. This is not a slight on the Bitcoin Core wallet, but a consequence of the architecture, in which privacy competes with the need for nodes to ideally be always-on and always-connected.

Supporting efficient creation of these indexes means that more users will be able to host them, improving Bitcoin's privacy in general. Currently, with this RPC my electrs testnet full index is 2.6G in size which is reasonable in comparison to the 32G of the testnet3 blockchain, and the initial indexing time has been greatly reduced as well.

jonasschnelli commented at 9:29 AM on December 19, 2020: contributor

I'm all-in for adding APIs for more efficient external indexing. I share @laanwj concern about the stableness of file pointers.

However, I disagree with @craigraw:

Supporting efficient indexes outside of Bitcoin Core is a critical privacy improvement. The Bitcoin Core wallet must keep the wallet's public keys unencrypted in order to track them, meaning that a compromise of the node is a complete compromise of the wallet privacy.

I don't actually follow the exact use case here. But it looks like you want to reduce the wallet-privacy-exposure time (the time publickeys/scripts are exposed to the memory). Which I think is a valid point. Though indexing "all" addresses for this seems an inefficient trade-off.

One seeking limited exposure time for their public-keys/scripts could and maybe should use blockfiters. It seems much better to scale and the filters are of value for the entire network.

Just scan the filters (timespan from last-wallet-sync to now) and scan the relevant blocks. Should be fairly quick for a couple of days or weeks (seconds).

In general, I discourage building full indexes for personal wallet backends. It is not efficient IMO and leads to development focus on the wrong end.

A full address index can be of value for pure exploring, inspecting, developing, etc.

Concept ~0 on this.

Kixunil commented at 10:19 AM on December 19, 2020: none

@jonasschnelli block filters would be a great argument if some benchmarks were provided and wallets actually supported them. Currently, one of the most reasonable wallets - Electrum requires a specific protocol which is not block filters. The server could perhaps use block filters in the background, the question is is it still possible/reasonable? My understanding is that block filters still have O(block_count) complexity, which seems high for each request.

romanz force-pushed on Dec 19, 2020

laanwj commented at 11:02 AM on December 19, 2020: member

arbitrarily during run time - this seems incredibly unlikely outside of reorgs (reindexing needs to be done in case of reorgs anyway)

Blocks don't move around on disk ever. A re-org keeps the old blocks, it only changes connections in the block database. (It might make the same block height point to a different block, of course, but this is not a different issue from when the RPC is used to retrieve the block data) The only conflict that can really happen right now is that the block file is deleted by pruning. But this can be avoided by making the client manage pruning through pruneblockchain.

Kixunil commented at 4:20 PM on December 19, 2020: none

Yeah, I meant in the future. Still unlikely. :) Pruning already needs to be disabled for electrs to work at all (this is well-documented), so doing anything else would be usage error.

romanz force-pushed on Dec 20, 2020

romanz commented at 12:23 PM on December 20, 2020: contributor

I have changed the RPC to fail "early" in pruned mode. Also, improved the tests to cover various batch sizes and RPC failures.

RPC: Add getblocklocations call

This RPC allows the client to retrieve the file system locations
of the confirmed blocks and their undo data, to allow building
efficient indexes outside of Bitcoin Core.

An example usage is described here:
https://github.com/romanz/electrs/issues/308

By using the new RPC, it is possible to build an address-based
index taking ~24GB and a txindex taking ~6GB (as of Dec. 2020).

9b03c654eb

romanz force-pushed on Dec 20, 2020

romanz commented at 2:20 PM on December 29, 2020: contributor

My understanding is that block filters still have O(block_count) complexity, which seems high for each request.

IIUC, the blockfilters today take ~5.8GB so it would probably take a while scanning them given a specific address. Using a "global address index" (like the one suggested here) allows efficient lookup for all transactions funding/spending specific script pubkey. For example, I have looked up 375Bia3NWiBqR89184zxRTDk1hWRpfKKK1 - having a short history (one funding transaction and one spending transactions).

First lookup (with cold OS cache - after echo 3 | sudo tee /proc/sys/vm/drop_caches):

[2020-12-29T14:07:58.433Z DEBUG electrs_rpc] 0: recv {"jsonrpc": "2.0", "method": "blockchain.scripthash.subscribe", "id": 11, "params": ["217af682a23ba6d8ae1f59b24493d6ea4adec0e110899f1d04c95bacd86d1cde"]}
[2020-12-29T14:07:58.579Z DEBUG electrs_index::index] 217af682a23ba6d8ae1f59b24493d6ea4adec0e110899f1d04c95bacd86d1cde has 2 rows
[2020-12-29T14:07:58.783Z DEBUG electrs_rpc] 0: send {"id":11,"jsonrpc":"2.0","result":"3f56029cc7c23fc5d4e5f4d43632530e4f7daec287b34b13f84c4b7d93448ab5"}

Subsequent lookups (with warm OS cache - after restarting electrs):

[2020-12-29T14:10:27.009Z DEBUG electrs_rpc] 0: recv {"jsonrpc": "2.0", "method": "blockchain.scripthash.subscribe", "id": 11, "params": ["217af682a23ba6d8ae1f59b24493d6ea4adec0e110899f1d04c95bacd86d1cde"]}
[2020-12-29T14:10:27.018Z DEBUG electrs_index::index] 217af682a23ba6d8ae1f59b24493d6ea4adec0e110899f1d04c95bacd86d1cde has 2 rows
[2020-12-29T14:10:27.019Z DEBUG electrs_rpc] 0: send {"id":11,"jsonrpc":"2.0","result":"3f56029cc7c23fc5d4e5f4d43632530e4f7daec287b34b13f84c4b7d93448ab5"}

The first lookup goes to RocksDB-based index, taking ~150ms (cold cache) / ~9ms (warm cache). The subsequent reads go to the actual blk*.dat files to read the transactions from the filesystem, taking ~200ms (cold) / ~1ms (warm). Note: the index and blk*.dat files are stored on a HDD-based storage.

romanz commented at 2:29 PM on January 2, 2021: contributor

BTW, it takes ~1m22s to find the relevant blocks for 375Bia3NWiBqR89184zxRTDk1hWRpfKKK1, using #20664 scanblockfilters using the same machine from #20702 (comment):

$ ~/Code/bitcoin-core/src/bitcoin-cli getblockchaininfo | jq .blocks
664151

$ time ~/Code/bitcoin-core/src/bitcoin-cli scanblockfilters '["addr(375Bia3NWiBqR89184zxRTDk1hWRpfKKK1)"]' 600000
[
  "00000000000000000007316856900e76b4f7a9139cfbfba89842c8d196cd5f91",
  "0000000000000000000855f761d212c4fa432977a4000c31996dab4713a57016"
]

real	1m22.170s
user	0m0.004s
sys	0m0.000s

luke-jr commented at 12:39 AM on January 3, 2021: member

But on the other hand, the format of the block files specifically hasn't ever significantly changed, and is (I think) unlikely to change.

Undo files, on the other hand, I could see changing significantly, and are implementation-specific.

Pruning already needs to be disabled for electrs to work at all

It doesn't make sense to only add this for electrs. Please enable this for pruning...

As long as we don't reuse old block file numbers, it should be fine. (The application calling might need to handle the error, but it would need to handle a RPC exception too)

Kixunil commented at 9:17 PM on January 3, 2021: none

@luke-jr I meant electrs will not work with pruning but yes, https://github.com/bitcoin/bitcoin/pull/20702/files#diff-decae4be02fb8a47ab4557fe74a9cb853bdfa3ec0fa1b515c0a1e5de91f4ad0bR2499 could be more fine-grained.

luke-jr commented at 9:21 PM on January 3, 2021: member

I think it would be best to simply remove the check entirely. Expect clients to deal with the possibility that on a pruned node the returned file could be missing.

Kixunil commented at 9:31 PM on January 3, 2021: none

@luke-jr yeah, I was originally thinking of distinguishing it from other errors when sending response to the client for debugging purposes. Should make error messages and troubleshooting much easier.

luke-jr commented at 9:35 PM on January 3, 2021: member

It creates a race condition. Now there are two ways a pruned block might fail: an exception, or a missing file. Better to just have the missing file be consistently the only failure mode.

Also, an exception means you have to retry to get the locations of the other blocks. Instead, returning the deleted files means you can get all of the results and deal with the missing ones later without any further requests.

Kixunil commented at 9:48 PM on January 3, 2021: none

Do I understand correctly that without explicit check, bitcoind just returns the location even if the file is actually deleted and the client then finds out that the file is deleted only by attempting to access it?

gmaxwell commented at 4:20 AM on January 5, 2021: contributor

But on the other hand, the format of the block files specifically hasn't ever significantly changed,

The format of the files has when blocks could be added out of order, and it broke armory which expected to read them. It was also discussed changing it to obfscuate the blocks to prevent anti-virus false positives but it turned out that AV systems simply ignore files over some threshold size and blocks are big enough.

and is (I think) unlikely to change.

Maybe it's unlikely to change, but it isn't the case that there aren't reasons to change it.

In the blockstream bitcoinsatellite repository there is an alternative serializer for transactions that reduces the transaction sizes by ~25% ( https://github.com/Blockstream/bitcoinsatellite/blob/master/src/compressor.cpp ). It could be deployed for local blockfiles at the cost of adding a to the places where the block is serialized/deserialized on disk and would reduce node storage for the blockchain currently by about 74GB in exchange for making operations which read blocks more cpu intense. I'd personally have one more unpruned full node running at least for another year, with such a change.

(I cite that code because its existing and not hypothetical-- it isn't even the limit of what could be done, but just a concrete example that requires very little change to the surrounding codebase, and could also be extended to reduce P2P bandwidth by a similar amount (at least after erlay gets rid of most of the rumouring overhead)).

If you have to make a round trip request to the RPC to get the location is actually fetching the block itself over the RPC that much worse? I would assume most of the cost is in the request roundtrip and for similar cost you could request the block and index it and maintain an index from address (or whatever) to block number/hash.

Kixunil commented at 11:12 AM on January 5, 2021: none

@gmaxwell

Isn't versioning sufficient to help with the API changing? Bitcoind changed its API in breaking way many times and it seems nobody cared. There's no need to guarantee it will not change, just bump a version number if it does.

If you have to make a round trip request to the RPC to get the location is actually fetching the block itself over the RPC that much worse?

My understanding is that script hash -> block location mappings are stored in electrs db, so it optimizes the subsequent queries of transactions - no RPC roundtrip, just read it from disk.

gmaxwell commented at 3:16 PM on January 5, 2021: contributor

@Kixunil Lets imagine that we knew the next major version would add block compression (it certainly could, it's been written after all). If that happened then every user of this interface would be broken by the change, and couldn't move forward without either radically changing what it was doing or adopting complex core specific code to decode the blocks. So then compatibility becomes an impediment to implementing compression and improving it in subsequent versions.

If it were decided that it was okay to adopt this interface with no promise of compatibility in the future-- with an answer of "tough luck" should other improvements make it incompatible, then what is the need to upstream a patch to add this interface? Anyone who wanted it could apply it locally. It would be an unsupported interface, but I think that is exactly what any interface where there isn't an intention to preserve it is just that-- an unsupported interface.

Bitcoind changed its API in breaking way many times and it seems nobody cared.

Many people have cared very much, and created ecosystem problems by delaying/avoiding upgrading because of a compatibility breaking change. And this is with an API that mostly only exposed things whos functionality could be preserved if only with some minor renaming.

There is a big difference between a break where you have to change from reading in one place to reading in two others -- just a few minutes work and a minor change to the caller -- compared to a change where the equivalent functionality is just completely gone without any ready alternative. I could cite an example where bitcoind interface changed in a way where the functionality was just gone with no ready drop in alternative, but that has been extremely rare (and the only example I can think of off the top of my head is where the removed functionality was basically broken).

craigraw commented at 8:22 AM on January 6, 2021: none

Speaking as an external wallet developer, I hope to provide some context that may be useful. One of the original goals of my project was to provide an application that took full advantage of a local copy of the blockchain, providing enough functionality to convince people to run their own node, and thus increasing the number of nodes and improving the bitcoin network generally. To name some of these features: Quick loading of any wallet, blockchain exploration, transaction history analysis, and the privacy aspect I mentioned above. As we see from the timing results, some of these are not practical without access to the full ledger, with the addition of an address-based index.

The reality is however that it is quite resource intensive to maintain such an index - for example runnning ElectrumX requires an SSD and long initial processing, resulting in a 60GB mainnet index replicating much blockchain data, and requiring further processing to maintain. Few users have the capacity for such an undertaking. On the other hand, this proposal reduces required disk space, but more importantly indexing time, making it much more accessible particularly to pre-built node offerings.

Therefore, the tradeoff to maintaining conservative and restricted access to confirmed blocks may be fewer full nodes, as projects which attempt to fully leverage a local ledger fail to get critical mass due to the high initial bar to gain relevant access to it.

couldn't move forward without ... adopting complex core specific code to decode the blocks

IMO, there would certainly be considerable motivation to port such code - the value of direct access to the ledger is substantial.

sipa commented at 8:58 AM on January 6, 2021: member

@craigraw That's an interesting perspective I hadn't considered: that having access to an indexed blockchain means address/key/script information can be offline more of the time.

I'm not sure there is much that can be done about that. I consider anything that needs fast access to the full blockchain (even absent any indexing) as inherently too resource intensive and unscalable already. It's not realistic to assume end users will want to dedicate the resources to operate that, and thus any infrastructure relying on that will inevitably end up with trusted third parties. Whether a copy of the chain is involved in doing so seems like a small (literally, 2) constant factor that at best kicks the can down the road for a few years. It doesn't even need to be a factor two; you can run Bitcoin Core in pruning mode, and an indexing service (which copies the full blocks into its database, in whatever form is most convenient to them) that does not prune on top.

IMHO the most scalable approach is simply having local wallets (online, but possibly offline most of the time) that scan each block once (in full, or using filters) and remember transactions that matter to them. That doesn't need fast access (just streaming), and doesn't need indexing.

There are of course use cases for full indexing; debugging, certain large services, ... and we should enable them the best we can. But it feels hacky, and hard to maintain a stable interface, to try to do that using Bitcoin Core's own storage, which was never designed for this purpose, and will going forward have different priorities.

romanz commented at 7:48 AM on January 7, 2021: contributor

Many thanks for the comments and the feedback!

But it feels hacky, and hard to maintain a stable interface, to try to do that using Bitcoin Core's own storage, which was never designed for this purpose, and will going forward have different priorities.

I agree that the current implementation is indeed "hacky" - will try to come up with a better approach.

Kixunil commented at 3:41 PM on January 7, 2021: none

It's not realistic to assume end users will want to dedicate the resources to operate that

I'm not sure. People have the option to pirate digital content for free, yet many choose to spend money instead. The best theory explaining it so far is convenience. If this applies more widely, if having chain indexed has significantly better UX, then I expect at least some people paying for it.

IMHO the most scalable approach is simply having local wallets (online, but possibly offline most of the time) that scan each block once (in full, or using filters) and remember transactions that matter to them. That doesn't need fast access (just streaming), and doesn't need indexing.

Frankly, EPS tried this and it led to UX so horrible that I'm highly surprised anyone is using it at all. Perhaps there's a different way to do it that would be saner but I'm not convinced there's an obvious solution.

Anyway, since @romanz agreed to look into other solutions, I'm not going to push this further. Just wanted to give an insight from my experience.

romanz closed this on Jan 12, 2021

luke-jr referenced this in commit e498d3c059 on Jan 28, 2021

DrahtBot locked this on Aug 16, 2022