COutPoint
for the output which was spent/created, and the CDiskTxPos
for the transaction in which this happened.
COutPoint
for the output which was spent/created, and the CDiskTxPos
for the transaction in which this happened.
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
Reviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.
95@@ -93,6 +96,33 @@ class BaseIndex : public CValidationInterface
96
97 /// Stops the instance from staying in sync with blockchain updates.
98 void Stop();
99+
100+ bool IsInSyncWithMainChain() const;
101+};
102+
103+struct CDiskTxPos : public CDiskBlockPos
In commit “Introduce address index” (3c7cc3c705b08828dc8dc919e53343df78568ebd)
Note: this class is moved from src/index/txindex.cpp
with no changes (except whitespace)
src/index/disktxpos.{h,cpp}
.
18+
19+constexpr char DB_ADDRINDEX = 'a';
20+std::unique_ptr<AddrIndex> g_addrindex;
21+
22+/**
23+ * Access to the addrindex database (indexes/addrindex/)
In commit “Introduce address index” (3c7cc3c705b08828dc8dc919e53343df78568ebd)
Note: new index/addrindex.cpp
, index/addrindex.h
, and test/addrindex_tests.cpp
files in this commit mirror existing index/txindex.cpp
and index/txindex.h
, test/txindex_tests.cpp
files and have some code and comments in common. It can help to diff the addr
files against the tx
files when reviewing this PR.
38+ bool ReadAddrIndex(const uint64_t addr_id,
39+ std::vector<std::pair<std::pair<char, uint64_t>, CDiskTxPos>> &keys_found,
40+ const bool filter_by_value = false,
41+ const uint64_t value_wanted = 0);
42+
43+ bool WriteToIndex(const std::vector<std::pair<uint64_t, CDiskTxPos>> &positions, const uint256 block_hash);
In commit “Introduce address index” (3c7cc3c705b08828dc8dc919e53343df78568ebd)
I think this code could benefit from some typedefs and more consistent use of types internally to improve readability. Maybe:
0using AddressId = uint64_t;
1using BlockId = uint64;
2using DbKey = std::pair<std::pair<char, AddressID>, CDiskTxPos>;
3using DbValue = BlockId;
I also wonder if std::tuple<char, AddressId, CDiskTxPos>
might be a more natural key format than the nested pair.
50+{}
51+
52+BaseIndex::DB& AddrIndex::GetDB() const { return *m_db; }
53+
54+bool AddrIndex::DB::ReadAddrIndex(const uint64_t addr_id,
55+ std::vector<std::pair<std::pair<char, uint64_t>, CDiskTxPos>> &keys_found,
In commit “Introduce address index” (3c7cc3c705b08828dc8dc919e53343df78568ebd)
Could just make this a return value instead of an output parameter. Existing return value seems redundant.
52+BaseIndex::DB& AddrIndex::GetDB() const { return *m_db; }
53+
54+bool AddrIndex::DB::ReadAddrIndex(const uint64_t addr_id,
55+ std::vector<std::pair<std::pair<char, uint64_t>, CDiskTxPos>> &keys_found,
56+ const bool filter_by_value,
57+ const uint64_t value_wanted){
In commit “Introduce address index” (3c7cc3c705b08828dc8dc919e53343df78568ebd)
Could consolidate last two parameters into single boost::optional<AddressId> filter_value
param.
119+ if (!g_txindex->FindTx(tx_in.prevout.hash, block_hash, tx)) {
120+ // Both addrindex and txindex may be syncing in parallel, and addrindex might
121+ // be ahead of txindex. We let txindex sync first so that addrindex can continue
122+ // after it.
123+ while (!g_txindex->IsInSyncWithMainChain()) {
124+ MilliSleep(1000); //TODO: find a less arbitrary sleep time.
In commit “Introduce address index” (3c7cc3c705b08828dc8dc919e53343df78568ebd)
This sleep + BlockUntilSyncedToCurrentChain call is pretty hacky but probably ok in principle. I think @jimpo could easily tell you how to improve it. I would move this code out of the AddrIndex
class and into a g_txindex->WaitForSync()
or similar method, and probably replace the sleep with a condition variable wait.
Or maybe this code will be unnecessary with #14035 undo data?
61+
62+ iter->Seek(key_prefix);
63+ while (iter->Valid()) {
64+ std::pair<std::pair<char, uint64_t>, CDiskTxPos> key;
65+ uint64_t value;
66+ if (!iter->GetKey(key) || !iter->GetValue(value) || key.first != key_prefix) break;
In commit “Introduce address index” (3c7cc3c705b08828dc8dc919e53343df78568ebd)
Should probably reorder this condition. It seems pointless to retrieve the value when the key is wrong, and maybe dangerous if the value is a different format and deserializing could throw an exception.
203@@ -203,6 +204,123 @@ static UniValue getrawtransaction(const JSONRPCRequest& request)
204 return result;
205 }
206
207+static UniValue searchrawtransactions(const JSONRPCRequest& request) {
In commit “Add searchrawtransactions RPC” (226eeea9736127269058fbbf7816c7100f90974d)
Should add a basic python functional test for the new RPC method.
Reviewed most of the code, but just skimmed tests. It looks to me like this PR could be merged basically in its current form, so I’m curious if you’re intending to make the improvements cited in the PR description above here or in a separate PRs.
I left a few minor suggestions about the code, which you should feel free to ignore. The only changes I would definitely like to see here are:
test/functional/
to call the new rpc methoddoc/release notes.md
to describe the feature and maybe mention some use-cases1624@@ -1611,6 +1625,13 @@ bool AppInitMain()
1625 g_txindex->Start();
1626 }
1627
1628+ if (gArgs.GetBoolArg("-addrindex", DEFAULT_ADDRINDEX)) {
1629+ if (!g_txindex)
1630+ InitWarning(_("-txindex must be enabled for -addrindex to index spends from addresses."));
addrindex=
I assume they know why. Also prevents a long debate :-)
211+ "\nReturns raw transactions that contain the given address and the hash of the block(s) they were found in.\n"
212+ "\nRequires -addrindex to be enabled.\n"
213+ "\nArguments:\n"
214+ "1. \"address\" (string, required) The address to search for\n"
215+ "2. \"verbose\" (bool, optional, default = false) If set to false, only returns data for hex-encoded `txid`s. \n"
216+ "3. \"skip\" (numeric, optional, default = 0) If set, the result skips this number of initial values. \n"
skip
and count
probably make sense on an options object instead.
212+ "\nRequires -addrindex to be enabled.\n"
213+ "\nArguments:\n"
214+ "1. \"address\" (string, required) The address to search for\n"
215+ "2. \"verbose\" (bool, optional, default = false) If set to false, only returns data for hex-encoded `txid`s. \n"
216+ "3. \"skip\" (numeric, optional, default = 0) If set, the result skips this number of initial values. \n"
217+ "3. \"count\" (numeric, optional, default = 100) If set, the result will only contain this amount of values. \n"
109+ for (const auto& tx : block.vtx) {
110+ for (const auto tx_out : tx->vout){
111+ positions.emplace_back(GetAddrID(tx_out.scriptPubKey), pos);
112+ }
113+
114+ if (g_txindex && !tx->IsCoinBase()) {
This basically shouldn’t be here. The address used to receive a coin has zero relationship with the transaction spending that coin.
If someone really cares about getting it anyway, there should be a separate input txid/index -> spend txid index.
Spends from an address should at least get indexed under a separate key prefix even if it’s done by the same indexer.
Using the undo files is a much better way to get this information than the txindex.
127+ // It's also possible we can't find the tx in txindex because it fell behind in
128+ // the ValidationInterface queue. In this case we also let it finish before continuing.
129+ g_txindex->BlockUntilSyncedToCurrentChain();
130+
131+ // If we still can't find the tx then a re-org may have happened.
132+ if (!g_txindex->FindTx(tx_in.prevout.hash, block_hash, tx)) return false;
124+ MilliSleep(1000); //TODO: find a less arbitrary sleep time.
125+ }
126+
127+ // It's also possible we can't find the tx in txindex because it fell behind in
128+ // the ValidationInterface queue. In this case we also let it finish before continuing.
129+ g_txindex->BlockUntilSyncedToCurrentChain();
BlockUntilSyncedToCurrentChain
returns false immediately if m_synced
is false.
The loop waiting for IsInSyncWithMainChain
waits for txindex to finish in ThreadSync
. Once txindex finishes in ThreadSync
it will be updated by ValiditionInterface callbacks from that point on.
This line (since txindex must have m_synced
= true now) will actually block now until txindex syncs with the ValidationInterface queue.
BlockUntilSyncedToCurrentChain
becomes true.
20+std::unique_ptr<AddrIndex> g_addrindex;
21+
22+/**
23+ * Access to the addrindex database (indexes/addrindex/)
24+ *
25+ * The database stores a block locator of the chain the database is synced to
commit: Introduce address index
Can you move this comment to BaseIndex::DB
and delete from TxIndex::DB
as well? No need to copy it to every new index file, especially since it’s handled at the base layer.
153+ batch.Write(std::make_pair(std::make_pair(DB_ADDRINDEX, pos.first), pos.second), block_hash.GetUint64(0));
154+ }
155+ return WriteBatch(batch);
156+}
157+
158+void AddrIndex::BlockDisconnected(const std::shared_ptr<const CBlock> &block) {
if (!m_synced) return;
42+
43+ // Destructor is declared because this class contains a unique_ptr to an incomplete type.
44+ virtual ~AddrIndex() override;
45+
46+ /// Lookup transaction(s) by scriptPubKey. Fills txs vector with (block_hash, tx) pairs.
47+ bool FindTxsByScript(const CScript& dest, std::vector<std::pair<uint256, CTransactionRef>> &txs);
157+
158+void AddrIndex::BlockDisconnected(const std::shared_ptr<const CBlock> &block) {
159+ const uint64_t block_hash_bits = block->GetHash().GetUint64(0);
160+ std::unordered_set<uint64_t> addr_ids_to_remove;
161+
162+ {
277@@ -276,3 +278,8 @@ void BaseIndex::Stop()
278 m_thread_sync.join();
279 }
280 }
281+
282+
283+bool BaseIndex::IsInSyncWithMainChain() const {
61@@ -61,6 +62,8 @@ class BaseIndex : public CValidationInterface
62 void BlockConnected(const std::shared_ptr<const CBlock>& block, const CBlockIndex* pindex,
63 const std::vector<CTransactionRef>& txn_conflicted) override;
64
65+ void BlockDisconnected(const std::shared_ptr<const CBlock> &block) override;
45+
46+ /// Lookup transaction(s) by scriptPubKey. Fills txs vector with (block_hash, tx) pairs.
47+ bool FindTxsByScript(const CScript& dest, std::vector<std::pair<uint256, CTransactionRef>> &txs);
48+
49+ // Returns part of key used to store information in db.
50+ static uint64_t GetAddrID(const CScript& script);
51+
52+BaseIndex::DB& AddrIndex::GetDB() const { return *m_db; }
53+
54+bool AddrIndex::DB::ReadAddrIndex(const uint64_t addr_id,
55+ std::vector<std::pair<std::pair<char, uint64_t>, CDiskTxPos>> &keys_found,
56+ const bool filter_by_value,
filter_by_value
functionality isn’t used? Also, I’d rename to something like filter_by_block
because it’s not clear what value means and is tightly coupled to the database layout chosen.
Nice work! I’m glad someone’s working on this. Concept ACK.
Thanks for all the reviews.
To answer some of @jimpo’s questions:
2 & 4. I included part of the block hash so that in BlockDisconnected we ae sure to remove the entries in the index from this block only (that’s where filter_by_value
is used). The reason I chose to remove entries from the database is to prevent reading into a block file using an old CDiskTxPos
that may no longer be a valid position. Otherwise in FindTxsByScript
you could run into errors. You’re right that this problem would be better handled by higher level methods.
I think that returning just the outpoint is a better idea than the current choice so I’ll switch to that and try to incorporate all the other feedback here.
63+ while (iter->Valid()) {
64+ std::pair<std::pair<char, uint64_t>, CDiskTxPos> key;
65+ uint64_t value;
66+ if (!iter->GetKey(key) || !iter->GetValue(value) || key.first != key_prefix) break;
67+
68+ if (!filter_by_value || (filter_by_value && value == value_wanted)) {
filter_by_value
: what about !filter_by_value || value == value_wanted
instead?
141+ }
142+
143+ return m_db->WriteToIndex(positions, block.GetHash());
144+}
145+
146+bool AddrIndex::DB::WriteToIndex(const std::vector<std::pair<uint64_t, CDiskTxPos>>& positions, const uint256 block_hash)
block_hash
should be passed by const reference?
174@@ -173,6 +175,76 @@ TestChain100Setup::CreateAndProcessBlock(const std::vector<CMutableTransaction>&
175 return result;
176 }
177
178+// Based off of BuildChain in validation_block_tests.cpp
179+
180+// Build a chain of blocks that contains all of the transactions in txns.
181+void TestChain100Setup::BuildChain(const uint256 prev_hash, const uint32_t prev_time, int height, std::vector<CMutableTransaction> &txns, const CScript& scriptPubKey, std::vector<std::shared_ptr<const CBlock>>& blocks) {
prev_hash
should be passed by const reference?
88@@ -89,6 +89,15 @@ struct TestChain100Setup : public TestingSetup {
89 CBlock CreateAndProcessBlock(const std::vector<CMutableTransaction>& txns,
90 const CScript& scriptPubKey);
91
92+ void BuildChain(const uint256 prev_hash,
prev_hash
should be passed by const reference?
223+ blocks.push_back(shared_pblock);
224+
225+ BuildChain(blocks.back()->GetHash(), blocks.back()->nTime, height - 1, txns, scriptPubKey, blocks);
226+}
227+
228+void TestChain100Setup::CreateSpendingTxs(int coinbase_spent_offset, std::vector<CScript>& script_pub_keys, std::vector<CMutableTransaction> &spends, CScript coinbase_script_pub_key) {
coinbase_script_pub_key
should be passed by const reference?
307+ }
308+ return ret;
309+ }
310+
311+ std::vector<std::pair<uint256, CTransactionRef>>::const_iterator it = result.begin();
312+ while (it != result.end() && skip--) it++; // Skip first set of results as needed.
++it
319+ std::string hex_tx = EncodeHexTx(*(tuple.second), RPCSerializationFlags());
320+ tx_val.pushKV("hex", hex_tx);
321+ tx_val.pushKV("blockhash", tuple.first.GetHex());
322+ }
323+ ret.push_back(tx_val);
324+ it++;
++it
189+ block.vtx.resize(1);
190+
191+ // If this is the last block, add all remaining transactions.
192+ // Otherwise add with some randomness.
193+ for (auto it = txns.begin(); it != txns.end();) {
194+ bool add_tx = (height == 1) || (GetRandInt(height) < txns.size());
0test/test_bitcoin.cpp:194:60: warning: comparison of integers of different signs: 'int' and 'std::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
1 bool add_tx = (height == 1) || (GetRandInt(height) < txns.size());
2 ~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~
105+ std::vector<std::pair<uint64_t, CDiskTxPos>> positions;
106+ positions.reserve(2 * block.vtx.size()); // Most transactions have at least 1 input and 1 output.
107+
108+ // Index addresses of spent outputs if txindex is enabled,
109+ for (const auto& tx : block.vtx) {
110+ for (const auto tx_out : tx->vout){
0index/addrindex.cpp:110:25: warning: loop variable 'tx_out' of type 'const CTxOut' creates a copy from type 'const CTxOut' [-Wrange-loop-analysis]
1 for (const auto tx_out : tx->vout){
2 ^
3index/addrindex.cpp:110:14: note: use reference type 'const CTxOut &' to prevent copying
4 for (const auto tx_out : tx->vout){
5 ^~~~~~~~~~~~~~~~~~~
110+ for (const auto tx_out : tx->vout){
111+ positions.emplace_back(GetAddrID(tx_out.scriptPubKey), pos);
112+ }
113+
114+ if (g_txindex && !tx->IsCoinBase()) {
115+ for (const auto tx_in : tx->vin) {
163+ LOCK(cs_main);
164+ CCoinsViewCache view(pcoinsTip.get());
165+
166+ // Collect all addr_ids from txs in this block.
167+ for (const auto& tx : block->vtx) {
168+ for (const auto tx_out : tx->vout){
168+ for (const auto tx_out : tx->vout){
169+ addr_ids_to_remove.emplace(GetAddrID(tx_out.scriptPubKey));
170+ }
171+
172+ if (!tx->IsCoinBase()) {
173+ for (const auto tx_in : tx->vin){
107+
108+ // Mine blocks for coinbase maturity, so we can spend some coinbase outputs in the test.
109+ CScript coinbase_script_pub_key = CScript() << ToByteVector(coinbaseKey.GetPubKey()) << OP_CHECKSIG;
110+ for (int i = 0; i < 20; i++) {
111+ std::vector<CMutableTransaction> no_txns;
112+ const CBlock& block = CreateAndProcessBlock(no_txns, coinbase_script_pub_key);
0test/addrindex_tests.cpp:112:23: warning: unused variable 'block' [-Wunused-variable]
1 const CBlock& block = CreateAndProcessBlock(no_txns, coinbase_script_pub_key);
2 ^
301+
302+ UniValue ret(UniValue::VARR);
303+ std::vector<std::pair<uint256, CTransactionRef>> result;
304+ if (!g_addrindex->FindTxsByScript(scriptPubKey, result)) {
305+ if (!addrindex_ready) {
306+ throw JSONRPCError(RPC_INVALID_ADDRESS_OR_KEY,"Transactions with given address not found. Blockchain transactions are still in the process of being indexed");
Blockchain transactions are still in the process of being indexed
warning should always be shown, because there might be missing transactions while indexing is in progress (txindex
doesn’t have that problem).
Concept ACK. Suggest adding -addrindex
and searchrawtransactions <address>
to the PR description for easier review.
I would do “integrating output descriptors” in a different PR.
Lightly tested on macOS. It doesn’t find the coinbase transaction to mtnNvZY7iQCfzJCVon13tfJSVYDn9iiUWD
in block 120, mg4bQva8w7Cjs8KaKHeEbyXUaCzoQAcXyH
in block 10001 and several others.
39+public:
40+ /// Constructs the index, which becomes available to be queried.
41+ explicit AddrIndex(size_t n_cache_size, bool f_memory = false, bool f_wipe = false);
42+
43+ // Destructor is declared because this class contains a unique_ptr to an incomplete type.
44+ virtual ~AddrIndex() override;
02018-09-22 21:20:09 cpplint(pr=14053): src/index/addrindex.h:44: "virtual" is redundant since function is already declared as "override" [readability/inheritance] [4]
Comments to the concept
Some time ago I have rebased bitpay’s bitcore patches (from their insight block explorer) to bitcoin core here (without much work on my own)
The consensus back then seemed to be that more indexes in the core is not a good thing, and work should be instead done on external indexes. However I am no sure how this PR differs, I just quickly skimmed it.
68+ BOOST_CHECK(addr_index.BlockUntilSyncedToCurrentChain()); // Let the address index catch up.
69+ BOOST_CHECK(chainActive.Tip()->GetBlockHash() == block_hash); // Sanity check to make sure this block is actually being used.
70+
71+ // Now check that all the addresses we sent to are present in the index.
72+ for (int i = 0; i < 10; i++) {
73+ std::vector<std::pair<COutPoint, std::pair<CTransactionRef, uint256>>> spends;
spends
shadows a variable existing in the outer scope.
69+ BOOST_CHECK(chainActive.Tip()->GetBlockHash() == block_hash); // Sanity check to make sure this block is actually being used.
70+
71+ // Now check that all the addresses we sent to are present in the index.
72+ for (int i = 0; i < 10; i++) {
73+ std::vector<std::pair<COutPoint, std::pair<CTransactionRef, uint256>>> spends;
74+ std::vector<std::pair<COutPoint, std::pair<CTransactionRef, uint256>>> creations;
creations
.
Lightly tested on macOS. It doesn’t find the coinbase transaction to
mtnNvZY7iQCfzJCVon13tfJSVYDn9iiUWD
in block 120,mg4bQva8w7Cjs8KaKHeEbyXUaCzoQAcXyH
in block 10001 and several others.
I’m confused at what you’re referring to bymtn...
and mg4...
. The outputs for those two coinbase transactions are P2PK, so the only way to search for them is by scriptPubKey. I used the hex value of the scriptPubKey from here and was able to find it.
Maybe tag this needing concept ack.
This would probably be pretty easy to rebase, and has a previous concept ack from jimpo, but it’s not clear if other people want this change or might object to it.
I for one would be very happy to see an optional addr index + api finally available in bitcoin-core, and not as a separate patch or hacky thing. @marcinja is there an API in your PR that accepts multiple addresses as input and returns total received/spent for each? Some block explorers have this functionality and it is great because much more efficient than calling API for each addr individually.
Also I think it should be caller’s option (via param) whether to return list of tx or not. Sometimes only addr summary info is needed eg total received/spent and tx info just adds bloat and slows things down in such cases.
I haven’t looked at how this is implemented, at all– but the goal is one that is extraordinarily useful and addresses a severe and growing usability probable in the software that has burned me many times.
Sorry for the dumb question, but what’s the severe usability problem?
“usually just the Wrong way”
This sounds like an ivory tower attitude to me. So tell us wise ones, if an addrindex is the wrong way, then what exactly is the right way? Because thus far bitcoin-core’s answer seems to be: don’t lookup transactions related to an address AT ALL unless they are in your own wallet, which is useless for anyone trying to provide address lookup as a service. They then must either run a patched bitcoind or turn to btcd or some other 3rd party solution.
what exactly is the right way?
The right way to do what? Can you provide more detail? I think it’d be good to keep the discussion as concrete as possible and discuss specific problems and use cases.
ok, here are two tools I alone have written that need something like searchrawtransactions().
hd-wallet-addrs - a tool for discovering all hd-wallet addresses that have ever been used. So it derives [receive,change] and checks if each address has ever received funds, up to gap limit. Presently it must rely on 3rd party block explorers APIs.
bitprices A tool for querying all transactions for all used addresses in a wallet (or any group of addresses) and generating reports including historical prices on the date of each transaction. Useful for audits and finding gain/loss, etc after the fact. Presently btcd has the best/fastest implementation, but only after I submitted patches for searchrawtransactions.
Here is a website frontend.
Note that the above tools work with any arbitrary seeds or addresses and are expected to return results quickly, so creating a watch-only wallet for each request is not a good option, and anyway it would then just be discarded immediately, so why bother?
The first two links above actually link to discussions of the various API providers and their various strengths/weaknesses for these purposes.
Others have their own reasons. afaict, most block explorers or bitcoin API providers need some sort of address index and are rolling their own somehow.
Thanks. From what I can gather the tools described in #14053 (comment) are accounting / auditing tools useful for someone running the bitcoin wallet, and this PR would let people who want to use these tools run them locally instead of relying on centralized third party services.
It seems like a good argument in favor of this feature.
I’m not sure what the better arguments against this feature are, so it would be helpful if someone wanted to articulate them. One argument might be that we don’t want people getting used to using tools like these, because eventually in the future historical indexes will be less able to scale, and people who are using these tools in their workflow will be forced to turn to centralized third party services.
Concept ACK. The main argument against adding an address index to Bitcoin Core is that people/services will come to rely on them, and they’re not scalable long term. My response to that is that people are already using address indexes and block explorers. Providing/not providing that functionality in Bitcoin Core isn’t going to change that.
The new indexing infrastructure means that this can be done in a very unobtrusive way. The index has no direct interaction at all with validation or networking. Instead it hooks into the validation interface and runs in a background thread, and so should have no performance impact when the node is sync’ed (I expect this might slow down IBD as ActivateBestChain()
gets blocked on LimitValidationInterfaceQueue()
, but presumably users who are indexing the entire transaction history are prepared to wait a bit longer).
Concept ACK.
I know I’ve flip-flopped on this feature historically, but I agree very much with the points that @jnewbery makes. Trading the maintenance burden of ~800 lines of relatively non-consensus-risky indexing code for some very practical gains in user privacy is compelling to me. Even with the caveat that we don’t think some of these uses are architecturally sound for years and years.
One might make the argument that there’s a slippery slope here - if we consent to this index now, would demand for yet another index come along, and the next and so on? I don’t think so. In the few years I’ve followed this, there has been reliable demand for an address index (and numerous community implementations of varying repute) and nothing else - aside of course from the compact block index - so I don’t think a slippery-slope argument holds. @marcinja do you have time to rebase this PR?
Maybe we should log a warning when the address index is enabled, like:
Could also add a coda if the built in wallet is enabled:
The idea would be to give end users relying on third party accounting tools a way of knowing about the limitations of those tools, and encouraging authors to implement more scalable approaches.
134@@ -135,6 +135,8 @@ BITCOIN_CORE_H = \
135 httpserver.h \
136 index/base.h \
137 index/blockfilterindex.h \
138+ index/addrindex.h \
139+ index/disktxpos.h \
27+
28+private:
29+ const std::unique_ptr<DB> m_db;
30+
31+ // m_hash_seed is used by GetAddrID in its calls to MurmurHash3.
32+ // It is stored in the index, and restored from their on construction
18+
19+#include <boost/thread.hpp>
20+
21+std::unique_ptr<AddrIndex> g_addr_index;
22+
23+static constexpr char DB_ADDR_INDEX = 'a';
121+ DBKey key;
122+ DBValue value;
123+ if (!iter->GetKey(key) || key.m_addr_id != addr_id || !iter->GetValue(value) ) break;
124+
125+ // Check that the stored script matches the one we're searching for, in case of hash collisions.
126+ if (value.second != script) continue;
Does this branch need to advance the iterator? If not, I think value
will be the same the next time round and we’ll never break out of this loop.
I think the following is what we want:
0 ...
1 if (value.second == script) {
2 result.emplace_back(std::make_pair(key, value));
3 }
4
5 iter->Next();
6}
173+{
174+ CBlockUndo block_undo;
175+ CDiskTxPos pos(pindex->GetBlockPos(), GetSizeOfCompactSize(block.vtx.size()));
176+ std::vector<std::pair<DBKey, DBValue>> entries;
177+
178+ const bool not_genesis_block = (pindex->nHeight > 0);
not_thing
seems unintuitive to me. I think it’d be better to define genesis
and then test on !genesis
.
192+ // Skip coinbase inputs.
193+ if (not_genesis_block && i > 0) {
194+ const CTxUndo& tx_undo = block_undo.vtxundo[i-1];
195+ for (size_t k = 0; k < tx.vin.size(); ++k) {
196+ CScript spent_outputs_scriptpubkey = tx_undo.vprevout[k].out.scriptPubKey;
197+ DBKey key(DBKeyType::SPENT, GetAddrId(spent_outputs_scriptpubkey), tx.vin[k].prevout);
I don’t think this isn’t the information that we want to save in the address index. For a given scriptPubKey, a user wants to know:
So here, I think you want to save the txid and input index spending the coin. You’re actually saving the txid and output index that creates the coin, because you’re using the prevout.
(I’m not entirely sure about this. Perhaps you are trying to return the outpoint that created the coin in the spent outputs array, but that’s not clear to me from the RPC documentation).
I think this is correct because he is using the prevout in the key. When looking up by this script with this outpoint, one would want to find this transaction because it spends this outpoint.
AFAICT the value just contains the position of the relevant transaction on disk and the scriptPubKey (to detect collisions). The value does not contain any indexes into inputs or outputs.
222+ "\nRequires -addrindex to be enabled.\n"
223+ "\nArguments:\n",
224+ {
225+ {"address", RPCArg::Type::STR_HEX, RPCArg::Optional::NO, /* default_val */ "", "address or scriptPubKey"},
226+ {"verbose", RPCArg::Type::BOOL, RPCArg::Optional::OMITTED, /* default_val */ "false", "If false, return hex-encoded tx, otherwise return a json object"},
227+ {"count", RPCArg::Type::NUM, RPCArg::Optional::OMITTED, /* default_val */ "", "The block in which to look for the transaction"},
227+ {"count", RPCArg::Type::NUM, RPCArg::Optional::OMITTED, /* default_val */ "", "The block in which to look for the transaction"},
228+ },
229+ {
230+ RPCResult{"if verbose is not set or set to false",
231+ "\nResult:\n"
232+ " [ (array of json objects)\n"
creates
and spends
I’ve done a preliminary review and left a few comments.
I think this PR could use:
addrindex.cpp
or addrindex.h
.rpc_rawtransaction.py
.213@@ -213,6 +214,142 @@ static UniValue getrawtransaction(const JSONRPCRequest& request)
214 return result;
215 }
216
217+static UniValue searchrawtransactions(const JSONRPCRequest& request) {
218+ if (request.fHelp || request.params.size() < 1 || request.params.size() > 4)
0 if (request.fHelp || request.params.size() < 1 || request.params.size() > 3)
.Check()
syntax instead
215+
216+// FindTxsByScript fills the spends_result vector with outpoints corresponding
217+// to the output spent with the given script, and the transaction it was spent
218+// in. creations_result is filled with outpoints for outputs created with this
219+// script as their script pubkey, and the transactions they were created in.
220+bool AddrIndex::FindTxsByScript(const CScript& script,
321+ std::string hex_tx = EncodeHexTx(*(spend.second.first), RPCSerializationFlags());
322+ tx_val.pushKV("hex", hex_tx);
323+ }
324+ spends.push_back(tx_val);
325+ ++spends_it;
326+ }
count
then no creates will be returned. Ideally the spends and creates will be returned in order, rather than all spends first.
213@@ -213,6 +214,142 @@ static UniValue getrawtransaction(const JSONRPCRequest& request)
214 return result;
215 }
216
217+static UniValue searchrawtransactions(const JSONRPCRequest& request) {
118+ std::unique_ptr<CDBIterator> iter(NewIterator());
119+ iter->Seek(search_key);
120+ while (iter->Valid()) {
121+ DBKey key;
122+ DBValue value;
123+ if (!iter->GetKey(key) || key.m_addr_id != addr_id || !iter->GetValue(value) ) break;
blockfilterindex.cpp
.
485@@ -487,6 +486,44 @@ def run_test(self):
486 assert_equal(testres['allowed'], True)
487 self.nodes[2].sendrawtransaction(hexstring=rawTxSigned['hex'], maxfeerate='0.20000000')
488
489+ self._test_searchrawtransactions()
490+
491+ def _test_searchrawtransactions(self):
309+ while (spends_it != spends_result.end() && count--) {
310+ const auto& spend = *spends_it;
311+ UniValue tx_val(UniValue::VOBJ);
312+
313+ UniValue outpoint_val(UniValue::VOBJ);
314+ outpoint_val.pushKV("txid", spend.first.hash.GetHex());
0 outpoint_val.pushKV("txid", spend.second.first.hash.GetHex());
I’ll write a functional test that catches this too :)
246+ std::pair<CTransactionRef, uint256> result = std::make_pair(tx, header.GetHash());
247+
248+ // Place entry into correct vector depending on its type.
249+ switch (key.m_key_type) {
250+ case DBKeyType::SPENT:
251+ spends_result.emplace_back(std::make_pair(key.m_outpoint, result));
As pointed out by jnewbery in review club, I think this is wrong.
The DB’s key always represents the outpoint that created the scriptPubKey. For spends_result, you want the txid of the transaction that spends the scriptPubKey. So instead of the DB key’s outpoint as the transaction id to be returned to the user, you want the transaction id of the transaction in the DB value. Or, in the RPC code lines 314 and 315, use the txid from spend.second
.
Edit: Or maybe this field of the returned result is always the created_outpoint, even though it’s called the spent_outpoint sometimes? Would be good to clarify this.
Ran on i7-8750H, took about 48 hours on an already synced node. Took 200 GB disk space.
02020-01-08T02:22:01Z addr_index thread start
1...
22020-01-10T02:37:07Z addr_index is enabled at height 611001
32020-01-10T02:37:07Z addr_index thread exit
0$ du -h addr_index
1200G addr_index
I think @jnewbery was hinting at this in the PR Review Club as well: it seems like this index does not handle reorgs. BaseIndex handles this through Rewind which is not overridden.
Handling reorgs might actually be undesirable for this… but changing CDiskTxPos like I just suggested to a block height could be incompatible with saving references to blocks not in the main chain too. (Maybe storing height OR blockhash would work?)
Any update when this could be merged?
I kind of thought that this was still in conceptual review but it has received 5 Concept ACKs from experienced Core developers and I did not see much of the push back this has received historically, so I tend to believe this can be moved out of Conceptual review phase and move forward. There is still a lot of work to be done in terms of testing, documentation etc. @marcinja If you are currently unable to work on this I would be happy to take this over since I have some experience with indexes already :)
Past NACKs don’t disappear just because a new PR is opened…
Definitely agree, especially if the PRs are very close. But I think if a new PR is opened with a different approach it can still be expected for the reviewers who gave NACKs in the past, to participate in the discussion again and reinstate their NACK or someone else who agrees with their view should do it for them. Otherwise, we might never be able to move forward if a concept is NACKed and then the reviewer leaves the project or just does not comment on the new approach. But it would probably be good if @marcinja would address the past NACKs directly and give those reviewers another chance to say if they have changed their minds or not.
Any update when this could be merged? @baso10
This PR needs to be rebased, as you can see in #14053 (comment)
The PR is still going through peer review – see the discussion above
See https://github.com/bitcoin/bitcoin/blob/master/CONTRIBUTING.md#peer-review for more info about the peer review process in this project
Hey everyone. Just wanted to give a small update here. I am still planning on keeping this PR up-to-date and rebased. I’ve realized it’s more annoying to rebase after procrastinating, and also it’s not fair to the reviewers and other people interested so I will be much more diligent about rebasing now.
The total size of the address index right now is 223GB. I don’t have a good answer on index sync time because I was re-syncing on a laptop. This was done on the 8f3fc6b4b8fb1a2db3035398d3dd5a2ed91cc3f9 branch over the course of a few days. I’m going to time an IBD and addrindex sync on a desktop with a good SSD so as to get meaningful results.
The good news is that searchrawtransactions
is not prohibitively slow.
0time ./src/bitcoin-cli searchrawtransactions 3EiAcrzq1cELXScc98KeCswGWZaPGceT1d true 30000 > big-search
1
2./src/bitcoin-cli searchrawtransactions 3EiAcrzq1cELXScc98KeCswGWZaPGceT1d 4.79s user 1.95s system 49% cpu 13.575 total
(the file big-search
created here is about 366MB)
I have also confirmed that the address index has the same number of created/spent outpoints as on this explorer: https://blockstream.info/address/3EiAcrzq1cELXScc98KeCswGWZaPGceT1d
36+ * key_type is SPENT when the outpoint stored in the key is spent, i.e. it is
37+ * used as a prevout in a transaction input. It is CREATED when the outpoint is
38+ * created as a new COutpoint in that transaction.
39+ *
40+ * outpoints are stored in the key as opposed to in the DB value to preserve
41+ * uniqueness and to support multiple values for a a single addr_id and key_type
0 * uniqueness and to support multiple values for a single addr_id and key_type
25+ * stores one unique global value under the DBKeyType::SEED key that seeds the
26+ * MurmurHash3 hasher used to create AddrIds.
27+ *
28+ * The DB keys are structured as follows: <addr_id, key_type, outpoint>
29+ *
30+ * addr_id is the hash of the script_pub_key computed using MurmurHash3, a a
0 * addr_id is the hash of the script_pub_key computed using MurmurHash3, a
Adds index that relates scriptPubKeys to location of transactions in the
filesystem, along with the hash of the block they are found in, and the
outpoint information of the txout with the related script.
Setup address index in initialization process.
Add initialization warning and wallet feature request warning as
suggested by ryanofsky.
Adds searchrawtransactions RPC that uses the address index to lookup
transactions and outpoints by script and address. Adds basic functional
tests for searchrawtransactions.
SHA256(scriptPubKey)
while this PR is using MurmurHash3(scriptPubKey)
.
Also, ElectrumX & electrs are using RocksDB for the index storage - resulting in better performance and disk usage (compared to LevelDB).
@Talkless note that while some Electrum users run their own bitcoind, many do not. Electrum wants to support both use cases, and in fact the suspicion is that most users just use a public server. When using a public server, the client cannot use bitcoind RPC, hence in that case I don’t see how the middleware (e.g. ElectrumX) could be avoided.
For the own bitcoind use case, maybe the client could have another optional mode of operation where it uses bitcoind RPC directly, which is I guess what you are asking about. For that, an address-index in bitcoind is the main thing missing indeed, however not the only one. For one, the electrum protocol (the client<->“middleware server” connection) has address subscriptions - the client gets a notification when a history of one its addresses changes. We are also planning on soon adding another method into the protocol that allows txoutpoint->spender_txid
lookups (and notifications); I guess that could be implemented using the index in this PR albeit in a very inefficient way for heavily reused addresses.
IMHO there are multiple upsides for having this middleware setup for Electrum:
Nevertheless if someone steps up and contributes patches, this kind of thing could be added.
The total size of the address index right now is 223GB
That sounds much larger than expected.
Even with the txoutpoint->spender_txid
map I mentioned above, when using LevelDB, ElectrumX uses around 90 GiB of disk space.
The DB keys are structured as follows:
<addr_id, key_type, outpoint>
The DB values are simply:<CDiskTxPos, CScript>
Instead of having both addr_id
and CScript
, why not just put a long hash, e.g. sha256(CScript)
into the key?
Also, I expect most users of this index would also want txindex enabled. You might want to consider making address index dependent on txindex. Have you investigated how much space that would save? There would be no need to store CDiskTxPos
.
Another trick that ElectrumX uses is that only the best chain is indexed. We have a tx_num->txid
map as raw files on disk. A tx_num
is the index of the transaction in the linear history of the blockchain, a 5-byte integer (so e.g. the genesis block coinbase tx has tx_num=0
). This map uses around 17 GiB atm. With this, you can encode the txid part of the outpoint
as 5 bytes instead of 32 bytes.
I’m concept -0 on this.
My primary objection is that I think it’s a bad idea for any infrastructure to be built all that relies on having fully-indexed blockchain data available (this also applies to txindex, but we can’t just remove support…).
However, it seems many people want something like this, and are going to use it anyway. The question is then whether it belongs in the bitcoin-core codebase. Alternative, and more performant presumably, like electrs exist already too, so it isn’t exactly impossible to do this elsewhere.
Still, given that we now have the indexes infrastructure, it means that things like this are easy to add in a fairly modular way without invading consensus code. So if people really want this, fine.
Overall approach comment: I don’t think MurmurHash should be used for anything new; there are strictly better hash functions available. I’d suggest SipHash if that’s fast enough.
I’m also a little more negative on having this in Core than I previously was. After working in a few industrial contexts on wallet stuff, it’s clear to me that an address index is really only required if you want to implement a block explorer or do chain analysis. For both of these applications, using something like electrs seems sufficient.
For personal wallet management, a full address index is not required. I think the origin of some confusion is that things like the Electrum Personal Server have become synonymous with this kind of usage, but in reality a full index is overkill when descriptor-specific rescans can be done once for a historical backfill and then per-block scanning can be done from there on out.
I want to point out that this is a nice implementation and good work by @marcinja, but I’m leaning slightly against the inclusion of such an index in Core at this point.
My primary objection is that I think it’s a bad idea for any infrastructure to be built that relies on having fully-indexed blockchain data available.
I agree on this.
IMO the only use cases to ever use a full address index are:
1 (instant backup recovery) could be solved with either scantxoutset
(take a minute or two) or by an address-index for the utxo set only. But both would not restore the spent history.
A scalable non-enterprise solution to restore the spent history is using blockfilters. Scan through the filters and rescan only the relevant blocks (a matter of minutes), see #20664.
2 (a backend for thousands of wallets): out of scope for this project.
3 (explore purposes): I think this is a valid use case. Though adding this PR to Bitcoin Core will lead to many many projects using it in production increasing the traffic in this project and eventually steal time from existing contributors (rebase, maintenance, drag-along)
My main fear is that people are going to use this index (a full address index) to use it as an electrum(ish) backend for a handful of wallets.
With multiwallet, watch-only-wallets, PSBT, we have all tools to server multiple wallets in a scalable way for external applications.
I also think merging this as it is, would be in contradiction to the process- and repository-separation effort.
Therefore I’m ~0 (slightly towards NACK) to add this.
If this would be in another repository (still under bitcoin/*
) and process separated, I would ACK it.
Hi all, thanks for the feedback and review. This was an enjoyable PR to work on and I learned a lot from all your comments.
I’m closing this PR because its size probably requires stronger support from contributors to get in. It also seems more clear now that all of the practical use-cases are covered by existing features and some lightweight alternatives (https://github.com/bitcoin/bitcoin/pull/20664) .
I also agree that it would be bad to incentivize using an address index to support an external electrum wallet, when it’s not the intended use-case and would cause unnecessary burden on contributors and maintainers in this project, e.g. from users of those wallets wanting new features or updates.
marcinja
DrahtBot
ryanofsky
jimpo
luke-jr
Sjors
practicalswift
karelbilek
MarcoFalke
dan-da
abunsen
gmaxwell
jnewbery
jgarzik
jamesob
promag
narula
andrewtoth
jonatack
fjahr
Talkless
baso10
fanquake
romanz
c78867886
decryp2kanon
SomberNight
sipa
jonasschnelli
Labels
UTXO Db and Indexes
Needs Conceptual Review
Review club