Add address-based index (attempt 4?)

marcinja commented at 6:51 pm on August 24, 2018: contributor

Adds index to transactions by scriptPubKey. Based off of #2802. Stores hashes of scripts (hashed using Murmurhash3, with a hash seed that is stored in the index database), along with the COutPoint for the output which was spent/created, and the CDiskTxPos for the transaction in which this happened.

DrahtBot commented at 7:23 pm on August 24, 2018: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#20404 (Remove names from translatable strings by hebasto)
#20012 (rpc: Remove duplicate name and argNames from CRPCCommand by MarcoFalke)
#19873 (Flush dbcache early if system is under memory pressure by luke-jr)
#19806 (validation: UTXO snapshot activation by jamesob)
#19521 (Coinstats Index (without UTXO set hash) by fjahr)
#15946 (Allow maintaining the blockfilterindex when using prune by jonasschnelli)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

laanwj added the label UTXO Db and Indexes on Aug 25, 2018

DrahtBot added the label Needs rebase on Aug 25, 2018

marcinja force-pushed on Aug 27, 2018

DrahtBot removed the label Needs rebase on Aug 27, 2018

marcinja force-pushed on Aug 28, 2018

in src/index/base.h:103 in 3c7cc3c705 outdated

 95@@ -93,6 +96,33 @@ class BaseIndex : public CValidationInterface
 96 
 97     /// Stops the instance from staying in sync with blockchain updates.
 98     void Stop();
 99+
100+    bool IsInSyncWithMainChain() const;
101+};
102+
103+struct CDiskTxPos : public CDiskBlockPos

ryanofsky commented at 4:47 pm on August 29, 2018:

In commit “Introduce address index” (3c7cc3c705b08828dc8dc919e53343df78568ebd)

Note: this class is moved from src/index/txindex.cpp with no changes (except whitespace)

jimpo commented at 10:53 pm on August 30, 2018:

I’d vote to move it to it’s own file, like src/index/disktxpos.{h,cpp}.

in src/index/addrindex.cpp:23 in 3c7cc3c705 outdated

18+
19+constexpr char DB_ADDRINDEX = 'a';
20+std::unique_ptr<AddrIndex> g_addrindex;
21+
22+/**
23+ * Access to the addrindex database (indexes/addrindex/)

ryanofsky commented at 4:53 pm on August 29, 2018:

In commit “Introduce address index” (3c7cc3c705b08828dc8dc919e53343df78568ebd)

Note: new index/addrindex.cpp, index/addrindex.h, and test/addrindex_tests.cpp files in this commit mirror existing index/txindex.cpp and index/txindex.h, test/txindex_tests.cpp files and have some code and comments in common. It can help to diff the addr files against the tx files when reviewing this PR.

in src/index/addrindex.cpp:43 in 3c7cc3c705 outdated

38+    bool ReadAddrIndex(const uint64_t addr_id,
39+                       std::vector<std::pair<std::pair<char, uint64_t>, CDiskTxPos>> &keys_found,
40+                       const bool filter_by_value = false,
41+                       const uint64_t value_wanted = 0);
42+
43+    bool WriteToIndex(const std::vector<std::pair<uint64_t, CDiskTxPos>> &positions, const uint256 block_hash);

ryanofsky commented at 5:37 pm on August 29, 2018:

In commit “Introduce address index” (3c7cc3c705b08828dc8dc919e53343df78568ebd)

I think this code could benefit from some typedefs and more consistent use of types internally to improve readability. Maybe:

0using AddressId = uint64_t;
1using BlockId = uint64;
2using DbKey = std::pair<std::pair<char, AddressID>, CDiskTxPos>;
3using DbValue = BlockId;

I also wonder if std::tuple<char, AddressId, CDiskTxPos> might be a more natural key format than the nested pair.

in src/index/addrindex.cpp:55 in 3c7cc3c705 outdated

50+{}
51+
52+BaseIndex::DB& AddrIndex::GetDB() const { return *m_db; }
53+
54+bool AddrIndex::DB::ReadAddrIndex(const uint64_t addr_id,
55+                                  std::vector<std::pair<std::pair<char, uint64_t>, CDiskTxPos>> &keys_found,

ryanofsky commented at 5:47 pm on August 29, 2018:

In commit “Introduce address index” (3c7cc3c705b08828dc8dc919e53343df78568ebd)

Could just make this a return value instead of an output parameter. Existing return value seems redundant.

in src/index/addrindex.cpp:57 in 3c7cc3c705 outdated

52+BaseIndex::DB& AddrIndex::GetDB() const { return *m_db; }
53+
54+bool AddrIndex::DB::ReadAddrIndex(const uint64_t addr_id,
55+                                  std::vector<std::pair<std::pair<char, uint64_t>, CDiskTxPos>> &keys_found,
56+                                  const bool filter_by_value,
57+                                  const uint64_t value_wanted){

ryanofsky commented at 5:48 pm on August 29, 2018:

In commit “Introduce address index” (3c7cc3c705b08828dc8dc919e53343df78568ebd)

Could consolidate last two parameters into single boost::optional<AddressId> filter_value param.

in src/index/addrindex.cpp:124 in 3c7cc3c705 outdated

119+                if (!g_txindex->FindTx(tx_in.prevout.hash, block_hash, tx)) {
120+                    // Both addrindex and txindex may be syncing in parallel, and addrindex might
121+                    // be ahead of txindex. We let txindex sync first so that addrindex can continue
122+                    // after it.
123+                    while (!g_txindex->IsInSyncWithMainChain()) {
124+                        MilliSleep(1000); //TODO: find a less arbitrary sleep time.

ryanofsky commented at 6:14 pm on August 29, 2018:

In commit “Introduce address index” (3c7cc3c705b08828dc8dc919e53343df78568ebd)

This sleep + BlockUntilSyncedToCurrentChain call is pretty hacky but probably ok in principle. I think @jimpo could easily tell you how to improve it. I would move this code out of the AddrIndex class and into a g_txindex->WaitForSync() or similar method, and probably replace the sleep with a condition variable wait.

Or maybe this code will be unnecessary with #14035 undo data?

in src/index/addrindex.cpp:66 in 3c7cc3c705 outdated

61+
62+    iter->Seek(key_prefix);
63+    while (iter->Valid()) {
64+        std::pair<std::pair<char, uint64_t>, CDiskTxPos> key;
65+        uint64_t value;
66+        if (!iter->GetKey(key) || !iter->GetValue(value) || key.first != key_prefix) break;

ryanofsky commented at 6:22 pm on August 29, 2018:

In commit “Introduce address index” (3c7cc3c705b08828dc8dc919e53343df78568ebd)

Should probably reorder this condition. It seems pointless to retrieve the value when the key is wrong, and maybe dangerous if the value is a different format and deserializing could throw an exception.

in src/rpc/rawtransaction.cpp:223 in 226eeea973 outdated

203@@ -203,6 +204,123 @@ static UniValue getrawtransaction(const JSONRPCRequest& request)
204     return result;
205 }
206 
207+static UniValue searchrawtransactions(const JSONRPCRequest& request) {

ryanofsky commented at 6:32 pm on August 29, 2018:

In commit “Add searchrawtransactions RPC” (226eeea9736127269058fbbf7816c7100f90974d)

Should add a basic python functional test for the new RPC method.

ryanofsky commented at 6:58 pm on August 29, 2018: member

Reviewed most of the code, but just skimmed tests. It looks to me like this PR could be merged basically in its current form, so I’m curious if you’re intending to make the improvements cited in the PR description above here or in a separate PRs.

I left a few minor suggestions about the code, which you should feel free to ignore. The only changes I would definitely like to see here are:

adding some python code in test/functional/ to call the new rpc method
adding a blurb in doc/release notes.md to describe the feature and maybe mention some use-cases

in src/init.cpp:1630 in 4865275573 outdated

1624@@ -1611,6 +1625,13 @@ bool AppInitMain()
1625         g_txindex->Start();
1626     }
1627 
1628+    if (gArgs.GetBoolArg("-addrindex", DEFAULT_ADDRINDEX)) {
1629+        if (!g_txindex)
1630+            InitWarning(_("-txindex must be enabled for -addrindex to index spends from addresses."));

luke-jr commented at 12:37 pm on August 30, 2018:

There is no such thing as a “spend from address”

Sjors commented at 5:35 pm on September 7, 2018:

Suggest dropping “to index spends from addresses”; if someone configures addrindex= I assume they know why. Also prevents a long debate :-)

in src/rpc/rawtransaction.cpp:216 in 4865275573 outdated

211+            "\nReturns raw transactions that contain the given address and the hash of the block(s) they were found in.\n"
212+            "\nRequires -addrindex to be enabled.\n"
213+            "\nArguments:\n"
214+            "1. \"address\"    (string, required) The address to search for\n"
215+            "2. \"verbose\"    (bool, optional, default = false) If set to false, only returns data for hex-encoded `txid`s. \n"
216+            "3. \"skip\"       (numeric, optional, default = 0) If set, the result skips this number of initial values. \n"

luke-jr commented at 12:40 pm on August 30, 2018:

skip and count probably make sense on an options object instead.

in src/rpc/rawtransaction.cpp:217 in 4865275573 outdated

212+            "\nRequires -addrindex to be enabled.\n"
213+            "\nArguments:\n"
214+            "1. \"address\"    (string, required) The address to search for\n"
215+            "2. \"verbose\"    (bool, optional, default = false) If set to false, only returns data for hex-encoded `txid`s. \n"
216+            "3. \"skip\"       (numeric, optional, default = 0) If set, the result skips this number of initial values. \n"
217+            "3. \"count\"      (numeric, optional, default = 100) If set, the result will only contain this amount of values. \n"

luke-jr commented at 12:40 pm on August 30, 2018:

(this would be number 4, but n/a once on an options object)

in src/index/addrindex.cpp:114 in 4865275573 outdated

109+    for (const auto& tx : block.vtx) {
110+        for (const auto tx_out : tx->vout){
111+            positions.emplace_back(GetAddrID(tx_out.scriptPubKey), pos);
112+        }
113+
114+        if (g_txindex && !tx->IsCoinBase()) {

luke-jr commented at 12:45 pm on August 30, 2018:

This basically shouldn’t be here. The address used to receive a coin has zero relationship with the transaction spending that coin.

If someone really cares about getting it anyway, there should be a separate input txid/index -> spend txid index.

jimpo commented at 10:45 pm on August 30, 2018:

Spends from an address should at least get indexed under a separate key prefix even if it’s done by the same indexer.

Using the undo files is a much better way to get this information than the txindex.

in src/index/addrindex.cpp:132 in 4865275573 outdated

127+                    // It's also possible we can't find the tx in txindex because it fell behind in
128+                    // the ValidationInterface queue. In this case we also let it finish before continuing.
129+                    g_txindex->BlockUntilSyncedToCurrentChain();
130+
131+                    // If we still can't find the tx then a re-org may have happened.
132+                    if (!g_txindex->FindTx(tx_in.prevout.hash, block_hash, tx)) return false;

luke-jr commented at 12:45 pm on August 30, 2018:

This looks like a very fragile assumption.

in src/index/addrindex.cpp:129 in 4865275573 outdated

124+                        MilliSleep(1000); //TODO: find a less arbitrary sleep time.
125+                    }
126+
127+                    // It's also possible we can't find the tx in txindex because it fell behind in
128+                    // the ValidationInterface queue. In this case we also let it finish before continuing.
129+                    g_txindex->BlockUntilSyncedToCurrentChain();

luke-jr commented at 12:45 pm on August 30, 2018:

With this, why is the loop waiting for IsInSyncWithMainChain needed?

marcinja commented at 4:16 pm on August 30, 2018:

BlockUntilSyncedToCurrentChain returns false immediately if m_synced is false.

The loop waiting for IsInSyncWithMainChain waits for txindex to finish in ThreadSync. Once txindex finishes in ThreadSync it will be updated by ValiditionInterface callbacks from that point on.

This line (since txindex must have m_synced = true now) will actually block now until txindex syncs with the ValidationInterface queue.

jimpo commented at 10:49 pm on August 30, 2018:

A better way to do it is to loop until BlockUntilSyncedToCurrentChain becomes true.

luke-jr changes_requested

in src/index/addrindex.cpp:25 in 3c7cc3c705 outdated

20+std::unique_ptr<AddrIndex> g_addrindex;
21+
22+/**
23+ * Access to the addrindex database (indexes/addrindex/)
24+ *
25+ * The database stores a block locator of the chain the database is synced to

jimpo commented at 6:31 pm on August 30, 2018:

commit: Introduce address index

Can you move this comment to BaseIndex::DB and delete from TxIndex::DB as well? No need to copy it to every new index file, especially since it’s handled at the base layer.

in src/index/addrindex.cpp:158 in 4865275573 outdated

153+        batch.Write(std::make_pair(std::make_pair(DB_ADDRINDEX, pos.first), pos.second), block_hash.GetUint64(0));
154+    }
155+    return WriteBatch(batch);
156+}
157+
158+void AddrIndex::BlockDisconnected(const std::shared_ptr<const CBlock> &block) {

marcinja commented at 7:14 pm on August 30, 2018:

This needs an if (!m_synced) return;

in src/index/addrindex.h:47 in 3c7cc3c705 outdated

42+
43+    // Destructor is declared because this class contains a unique_ptr to an incomplete type.
44+    virtual ~AddrIndex() override;
45+
46+    /// Lookup transaction(s) by scriptPubKey. Fills txs vector with (block_hash, tx) pairs.
47+    bool FindTxsByScript(const CScript& dest, std::vector<std::pair<uint256, CTransactionRef>> &txs);

jimpo commented at 10:47 pm on August 30, 2018:

There should be a way of differentiating between txs where the script is the output and ones that spend from the address. Perhaps two methods or two output vector parameters.

in src/index/addrindex.cpp:162 in 3c7cc3c705 outdated

157+
158+void AddrIndex::BlockDisconnected(const std::shared_ptr<const CBlock> &block) {
159+    const uint64_t block_hash_bits = block->GetHash().GetUint64(0);
160+    std::unordered_set<uint64_t> addr_ids_to_remove;
161+
162+    {

jimpo commented at 10:51 pm on August 30, 2018:

Use the undo files instead of CCoinsViewCache.

in src/index/base.cpp:283 in 3c7cc3c705 outdated

277@@ -276,3 +278,8 @@ void BaseIndex::Stop()
278         m_thread_sync.join();
279     }
280 }
281+
282+
283+bool BaseIndex::IsInSyncWithMainChain() const {

jimpo commented at 10:52 pm on August 30, 2018:

This shouldn’t be necessary.

in src/index/base.h:65 in 3c7cc3c705 outdated

61@@ -61,6 +62,8 @@ class BaseIndex : public CValidationInterface
62     void BlockConnected(const std::shared_ptr<const CBlock>& block, const CBlockIndex* pindex,
63                         const std::vector<CTransactionRef>& txn_conflicted) override;
64 
65+    void BlockDisconnected(const std::shared_ptr<const CBlock> &block) override;

jimpo commented at 10:52 pm on August 30, 2018:

Why? The implementation is empty.

in src/index/addrindex.h:50 in 3c7cc3c705 outdated

45+
46+    /// Lookup transaction(s) by scriptPubKey. Fills txs vector with (block_hash, tx) pairs.
47+    bool FindTxsByScript(const CScript& dest, std::vector<std::pair<uint256, CTransactionRef>> &txs);
48+
49+    // Returns part of key used to store information in db.
50+    static uint64_t GetAddrID(const CScript& script);

jimpo commented at 10:56 pm on August 30, 2018:

Why is this exposed? It seems internal.

in src/index/addrindex.cpp:56 in 3c7cc3c705 outdated

51+
52+BaseIndex::DB& AddrIndex::GetDB() const { return *m_db; }
53+
54+bool AddrIndex::DB::ReadAddrIndex(const uint64_t addr_id,
55+                                  std::vector<std::pair<std::pair<char, uint64_t>, CDiskTxPos>> &keys_found,
56+                                  const bool filter_by_value,

jimpo commented at 11:00 pm on August 30, 2018:

Seems that filter_by_value functionality isn’t used? Also, I’d rename to something like filter_by_block because it’s not clear what value means and is tightly coupled to the database layout chosen.

jimpo changes_requested

jimpo commented at 11:15 pm on August 30, 2018: contributor

Nice work! I’m glad someone’s working on this. Concept ACK.

The AddrIndex should return information about the outpoints and differentiate between outputs and spends, not just return the raw transactions. In fact, the AddrIndex could just return outpoints, then the client code could use the TxIndex could to fetch the tx bodies. It’d involve a separate lookup though.
I don’t think it’s necessary to delete keys from the database when a block is disconnected. There’s no harm in leaving it. The higher level methods to search the index can then filter for results that are on the main chain if that’s what the client wants. It’d have to do this anyway to avoid races with reorgs and such.
I’m worried about collisions on address IDs because they are only 64 bits. I can think of three options, 1) use a 32 byte cryptographic hash, 2) use a 20 byte cryptographic hash of the script plus some randomly generated, database-global salt, 3) use a 32- or 64-bit non-cryptographic hash (might as well use Murmur3 or SipHash, not truncated SHA256), then store the full script as the database value to double check against. Option 3 feels best to me.
What’s the purpose of having the first byte of the block hash as the value? It doesn’t seem robust nor particularly useful.

marcinja commented at 3:19 pm on August 31, 2018: contributor

Thanks for all the reviews.

To answer some of @jimpo’s questions: 2 & 4. I included part of the block hash so that in BlockDisconnected we ae sure to remove the entries in the index from this block only (that’s where filter_by_value is used). The reason I chose to remove entries from the database is to prevent reading into a block file using an old CDiskTxPos that may no longer be a valid position. Otherwise in FindTxsByScript you could run into errors. You’re right that this problem would be better handled by higher level methods.

I think that returning just the outpoint is a better idea than the current choice so I’ll switch to that and try to incorporate all the other feedback here.

in src/index/addrindex.cpp:68 in 4865275573 outdated

63+    while (iter->Valid()) {
64+        std::pair<std::pair<char, uint64_t>, CDiskTxPos> key;
65+        uint64_t value;
66+        if (!iter->GetKey(key) || !iter->GetValue(value) || key.first != key_prefix) break;
67+
68+        if  (!filter_by_value || (filter_by_value && value == value_wanted)) {

practicalswift commented at 7:30 am on September 5, 2018:

Redundant condition filter_by_value: what about !filter_by_value || value == value_wanted instead?

in src/index/addrindex.cpp:146 in 4865275573 outdated

141+    }
142+
143+    return m_db->WriteToIndex(positions, block.GetHash());
144+}
145+
146+bool AddrIndex::DB::WriteToIndex(const std::vector<std::pair<uint64_t, CDiskTxPos>>& positions, const uint256 block_hash)

practicalswift commented at 7:30 am on September 5, 2018:

block_hash should be passed by const reference?

in src/test/test_bitcoin.cpp:181 in 4865275573 outdated

174@@ -173,6 +175,76 @@ TestChain100Setup::CreateAndProcessBlock(const std::vector<CMutableTransaction>&
175     return result;
176 }
177 
178+// Based off of BuildChain in validation_block_tests.cpp
179+
180+// Build a chain of blocks that contains all of the transactions in txns.
181+void TestChain100Setup::BuildChain(const uint256 prev_hash, const uint32_t prev_time, int height, std::vector<CMutableTransaction> &txns, const CScript& scriptPubKey, std::vector<std::shared_ptr<const CBlock>>& blocks) {

practicalswift commented at 7:30 am on September 5, 2018:

prev_hash should be passed by const reference?

in src/test/test_bitcoin.h:92 in 4865275573 outdated

88@@ -89,6 +89,15 @@ struct TestChain100Setup : public TestingSetup {
89     CBlock CreateAndProcessBlock(const std::vector<CMutableTransaction>& txns,
90                                  const CScript& scriptPubKey);
91 
92+    void BuildChain(const uint256 prev_hash,

practicalswift commented at 7:31 am on September 5, 2018:

prev_hash should be passed by const reference?

in src/test/test_bitcoin.cpp:228 in 4865275573 outdated

223+    blocks.push_back(shared_pblock);
224+
225+    BuildChain(blocks.back()->GetHash(), blocks.back()->nTime, height - 1, txns, scriptPubKey, blocks);
226+}
227+
228+void TestChain100Setup::CreateSpendingTxs(int coinbase_spent_offset, std::vector<CScript>& script_pub_keys, std::vector<CMutableTransaction> &spends, CScript coinbase_script_pub_key) {

practicalswift commented at 7:32 am on September 5, 2018:

coinbase_script_pub_key should be passed by const reference?

in src/rpc/rawtransaction.cpp:312 in 4865275573 outdated

307+        }
308+        return ret;
309+    }
310+
311+    std::vector<std::pair<uint256, CTransactionRef>>::const_iterator it = result.begin();
312+    while (it != result.end() && skip--) it++; // Skip first set of results as needed.

practicalswift commented at 7:32 am on September 5, 2018:

Nit: ++it

in src/rpc/rawtransaction.cpp:324 in 4865275573 outdated

319+            std::string hex_tx = EncodeHexTx(*(tuple.second), RPCSerializationFlags());
320+            tx_val.pushKV("hex", hex_tx);
321+            tx_val.pushKV("blockhash", tuple.first.GetHex());
322+        }
323+        ret.push_back(tx_val);
324+        it++;

practicalswift commented at 7:32 am on September 5, 2018:

Nit: ++it

in src/test/test_bitcoin.cpp:194 in 4865275573 outdated

189+    block.vtx.resize(1);
190+
191+    // If this is the last block, add all remaining transactions.
192+    // Otherwise add with some randomness.
193+    for (auto it = txns.begin(); it != txns.end();) {
194+        bool add_tx = (height == 1) || (GetRandInt(height) < txns.size());

practicalswift commented at 8:16 pm on September 6, 2018:

0test/test_bitcoin.cpp:194:60: warning: comparison of integers of different signs: 'int' and 'std::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
1        bool add_tx = (height == 1) || (GetRandInt(height) < txns.size());
2                                        ~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~

Sjors commented at 5:37 pm on September 7, 2018:

@MarcoFalke why is it again that Travis doesn’t fail on warnings?

in src/index/addrindex.cpp:110 in 4865275573 outdated

105+    std::vector<std::pair<uint64_t, CDiskTxPos>> positions;
106+    positions.reserve(2 * block.vtx.size());  // Most transactions have at least 1 input and 1 output.
107+
108+    // Index addresses of spent outputs if txindex is enabled,
109+    for (const auto& tx : block.vtx) {
110+        for (const auto tx_out : tx->vout){

practicalswift commented at 8:17 pm on September 6, 2018:

0index/addrindex.cpp:110:25: warning: loop variable 'tx_out' of type 'const CTxOut' creates a copy from type 'const CTxOut' [-Wrange-loop-analysis]
1        for (const auto tx_out : tx->vout){
2                        ^
3index/addrindex.cpp:110:14: note: use reference type 'const CTxOut &' to prevent copying
4        for (const auto tx_out : tx->vout){
5             ^~~~~~~~~~~~~~~~~~~

in src/index/addrindex.cpp:115 in 4865275573 outdated

110+        for (const auto tx_out : tx->vout){
111+            positions.emplace_back(GetAddrID(tx_out.scriptPubKey), pos);
112+        }
113+
114+        if (g_txindex && !tx->IsCoinBase()) {
115+            for (const auto tx_in : tx->vin) {

practicalswift commented at 8:17 pm on September 6, 2018:

Same here - see above.

in src/index/addrindex.cpp:168 in 4865275573 outdated

163+        LOCK(cs_main);
164+        CCoinsViewCache view(pcoinsTip.get());
165+
166+        // Collect all addr_ids from txs in this block.
167+        for (const auto& tx : block->vtx) {
168+            for (const auto tx_out : tx->vout){

practicalswift commented at 8:18 pm on September 6, 2018:

Same here - see above.

in src/index/addrindex.cpp:173 in 4865275573 outdated

168+            for (const auto tx_out : tx->vout){
169+                addr_ids_to_remove.emplace(GetAddrID(tx_out.scriptPubKey));
170+            }
171+
172+            if (!tx->IsCoinBase()) {
173+                for (const auto tx_in : tx->vin){

practicalswift commented at 8:18 pm on September 6, 2018:

Same here - see above.

in src/test/addrindex_tests.cpp:33 in 4865275573 outdated

107+
108+    // Mine blocks for coinbase maturity, so we can spend some coinbase outputs in the test.
109+    CScript coinbase_script_pub_key = CScript() <<  ToByteVector(coinbaseKey.GetPubKey()) << OP_CHECKSIG;
110+    for (int i = 0; i < 20; i++) {
111+        std::vector<CMutableTransaction> no_txns;
112+        const CBlock& block = CreateAndProcessBlock(no_txns, coinbase_script_pub_key);

practicalswift commented at 8:18 pm on September 6, 2018:

0test/addrindex_tests.cpp:112:23: warning: unused variable 'block' [-Wunused-variable]
1        const CBlock& block = CreateAndProcessBlock(no_txns, coinbase_script_pub_key);
2                      ^

in src/rpc/rawtransaction.cpp:313 in 4865275573 outdated

301+
302+    UniValue ret(UniValue::VARR);
303+    std::vector<std::pair<uint256, CTransactionRef>> result;
304+    if (!g_addrindex->FindTxsByScript(scriptPubKey, result)) {
305+        if (!addrindex_ready) {
306+            throw JSONRPCError(RPC_INVALID_ADDRESS_OR_KEY,"Transactions with given address not found. Blockchain transactions are still in the process of being indexed");

Sjors commented at 5:47 pm on September 7, 2018:

The Blockchain transactions are still in the process of being indexed warning should always be shown, because there might be missing transactions while indexing is in progress (txindex doesn’t have that problem).

Sjors commented at 6:13 pm on September 7, 2018: member

Concept ACK. Suggest adding -addrindex and searchrawtransactions <address> to the PR description for easier review.

I would do “integrating output descriptors” in a different PR.

Lightly tested on macOS. It doesn’t find the coinbase transaction to mtnNvZY7iQCfzJCVon13tfJSVYDn9iiUWD in block 120, mg4bQva8w7Cjs8KaKHeEbyXUaCzoQAcXyH in block 10001 and several others.

in src/index/addrindex.h:52 in 4865275573 outdated

39+public:
40+    /// Constructs the index, which becomes available to be queried.
41+    explicit AddrIndex(size_t n_cache_size, bool f_memory = false, bool f_wipe = false);
42+
43+    // Destructor is declared because this class contains a unique_ptr to an incomplete type.
44+    virtual ~AddrIndex() override;

practicalswift commented at 7:59 am on September 23, 2018:

02018-09-22 21:20:09 cpplint(pr=14053): src/index/addrindex.h:44:  "virtual" is redundant since function is already declared as "override"  [readability/inheritance] [4]

DrahtBot added the label Needs rebase on Sep 24, 2018

karelbilek commented at 6:40 am on October 5, 2018: contributor

Comments to the concept

Some time ago I have rebased bitpay’s bitcore patches (from their insight block explorer) to bitcoin core here (without much work on my own)

#10370

The consensus back then seemed to be that more indexes in the core is not a good thing, and work should be instead done on external indexes. However I am no sure how this PR differs, I just quickly skimmed it.

MarcoFalke commented at 8:02 am on October 5, 2018: member

@karel-3d I think this one is based on #13033, so at least the performance impact should be less pronounced. Other than that it should be rather similar.

Sjors commented at 10:14 am on December 15, 2018: member

We talked about (address) indexes during the previous IRC meeting.

marcinja force-pushed on Jan 22, 2019

DrahtBot removed the label Needs rebase on Jan 22, 2019

marcinja force-pushed on Jan 23, 2019

in src/test/addrindex_tests.cpp:73 in fc075a0d94 outdated

68+    BOOST_CHECK(addr_index.BlockUntilSyncedToCurrentChain()); // Let the address index catch up.
69+    BOOST_CHECK(chainActive.Tip()->GetBlockHash() == block_hash); // Sanity check to make sure this block is actually being used.
70+
71+    // Now check that all the addresses we sent to are present in the index.
72+    for (int i = 0; i < 10; i++) {
73+        std::vector<std::pair<COutPoint, std::pair<CTransactionRef, uint256>>> spends;

practicalswift commented at 8:48 am on January 24, 2019:

spends shadows a variable existing in the outer scope.

in src/test/addrindex_tests.cpp:74 in fc075a0d94 outdated

69+    BOOST_CHECK(chainActive.Tip()->GetBlockHash() == block_hash); // Sanity check to make sure this block is actually being used.
70+
71+    // Now check that all the addresses we sent to are present in the index.
72+    for (int i = 0; i < 10; i++) {
73+        std::vector<std::pair<COutPoint, std::pair<CTransactionRef, uint256>>> spends;
74+        std::vector<std::pair<COutPoint, std::pair<CTransactionRef, uint256>>> creations;

practicalswift commented at 8:48 am on January 24, 2019:

Same here but for creations.

marcinja force-pushed on Jan 24, 2019

MarcoFalke deleted a comment on Jan 25, 2019

DrahtBot added the label Needs rebase on Jan 30, 2019

marcinja force-pushed on Jan 30, 2019

DrahtBot removed the label Needs rebase on Jan 30, 2019

marcinja force-pushed on Jan 30, 2019

marcinja commented at 10:19 pm on January 30, 2019: contributor

@Sjors

Lightly tested on macOS. It doesn’t find the coinbase transaction to mtnNvZY7iQCfzJCVon13tfJSVYDn9iiUWD in block 120, mg4bQva8w7Cjs8KaKHeEbyXUaCzoQAcXyH in block 10001 and several others.

I’m confused at what you’re referring to bymtn... and mg4.... The outputs for those two coinbase transactions are P2PK, so the only way to search for them is by scriptPubKey. I used the hex value of the scriptPubKey from here and was able to find it.

DrahtBot added the label Needs rebase on Feb 19, 2019

marcinja force-pushed on Mar 6, 2019

DrahtBot removed the label Needs rebase on Mar 6, 2019

DrahtBot added the label Needs rebase on Mar 18, 2019

marcinja force-pushed on Mar 20, 2019

DrahtBot removed the label Needs rebase on Mar 20, 2019

DrahtBot added the label Needs rebase on Mar 20, 2019

marcinja force-pushed on Mar 20, 2019

DrahtBot removed the label Needs rebase on Mar 21, 2019

DrahtBot added the label Needs rebase on Mar 21, 2019

ryanofsky commented at 9:36 pm on July 17, 2019: member

Maybe tag this needing concept ack.

This would probably be pretty easy to rebase, and has a previous concept ack from jimpo, but it’s not clear if other people want this change or might object to it.

MarcoFalke added the label Up for grabs on Jul 18, 2019

fanquake added the label Needs Conceptual Review on Aug 5, 2019

dan-da commented at 6:50 am on August 7, 2019: none

I for one would be very happy to see an optional addr index + api finally available in bitcoin-core, and not as a separate patch or hacky thing. @marcinja is there an API in your PR that accepts multiple addresses as input and returns total received/spent for each? Some block explorers have this functionality and it is great because much more efficient than calling API for each addr individually.

Also I think it should be caller’s option (via param) whether to return list of tx or not. Sometimes only addr summary info is needed eg total received/spent and tx info just adds bloat and slows things down in such cases.

marcinja force-pushed on Aug 16, 2019

DrahtBot removed the label Needs rebase on Aug 16, 2019

abunsen commented at 2:33 pm on October 1, 2019: none

@marcinja you’re so close. Thanks for all your hard work on this! Does anyone know if this is getting merged anytime soon?

fanquake added this to the "Chasing Concept ACK" column in a project

MarcoFalke removed the label Up for grabs on Oct 16, 2019

DrahtBot added the label Needs rebase on Oct 16, 2019

gmaxwell commented at 11:12 am on November 20, 2019: contributor

I haven’t looked at how this is implemented, at all– but the goal is one that is extraordinarily useful and addresses a severe and growing usability probable in the software that has burned me many times.

ryanofsky commented at 2:48 pm on November 21, 2019: member

I haven’t looked at how this is implemented, at all– but the goal is one that is extraordinarily useful and addresses a severe and growing usability probable in the software that has burned me many times.

Sorry for the dumb question, but what’s the severe usability problem?

ryanofsky commented at 8:38 pm on November 25, 2019: member

Most recent IRC discussion I could find about address indexes, http://www.erisian.com.au/bitcoin-core-dev/log-2018-12-13.html#l-601 seemed to have conclusion (from @jamesob) with: “I’ll report on any compelling usecases I find for addr index, but I suspect sipa et al are right that that’s usually just the Wrong way”

dan-da commented at 8:52 pm on November 25, 2019: none

“usually just the Wrong way”

This sounds like an ivory tower attitude to me. So tell us wise ones, if an addrindex is the wrong way, then what exactly is the right way? Because thus far bitcoin-core’s answer seems to be: don’t lookup transactions related to an address AT ALL unless they are in your own wallet, which is useless for anyone trying to provide address lookup as a service. They then must either run a patched bitcoind or turn to btcd or some other 3rd party solution.

ryanofsky commented at 8:58 pm on November 25, 2019: member

what exactly is the right way?

The right way to do what? Can you provide more detail? I think it’d be good to keep the discussion as concrete as possible and discuss specific problems and use cases.

dan-da commented at 9:19 pm on November 25, 2019: none

ok, here are two tools I alone have written that need something like searchrawtransactions().

hd-wallet-addrs - a tool for discovering all hd-wallet addresses that have ever been used. So it derives [receive,change] and checks if each address has ever received funds, up to gap limit. Presently it must rely on 3rd party block explorers APIs.

bitprices A tool for querying all transactions for all used addresses in a wallet (or any group of addresses) and generating reports including historical prices on the date of each transaction. Useful for audits and finding gain/loss, etc after the fact. Presently btcd has the best/fastest implementation, but only after I submitted patches for searchrawtransactions.

Here is a website frontend.

Note that the above tools work with any arbitrary seeds or addresses and are expected to return results quickly, so creating a watch-only wallet for each request is not a good option, and anyway it would then just be discarded immediately, so why bother?

The first two links above actually link to discussions of the various API providers and their various strengths/weaknesses for these purposes.

Others have their own reasons. afaict, most block explorers or bitcoin API providers need some sort of address index and are rolling their own somehow.

ryanofsky commented at 9:37 pm on November 25, 2019: member

Thanks. From what I can gather the tools described in #14053 (comment) are accounting / auditing tools useful for someone running the bitcoin wallet, and this PR would let people who want to use these tools run them locally instead of relying on centralized third party services.

It seems like a good argument in favor of this feature.

I’m not sure what the better arguments against this feature are, so it would be helpful if someone wanted to articulate them. One argument might be that we don’t want people getting used to using tools like these, because eventually in the future historical indexes will be less able to scale, and people who are using these tools in their workflow will be forced to turn to centralized third party services.

jnewbery commented at 10:07 pm on November 25, 2019: member

Concept ACK. The main argument against adding an address index to Bitcoin Core is that people/services will come to rely on them, and they’re not scalable long term. My response to that is that people are already using address indexes and block explorers. Providing/not providing that functionality in Bitcoin Core isn’t going to change that.

The new indexing infrastructure means that this can be done in a very unobtrusive way. The index has no direct interaction at all with validation or networking. Instead it hooks into the validation interface and runs in a background thread, and so should have no performance impact when the node is sync’ed (I expect this might slow down IBD as ActivateBestChain() gets blocked on LimitValidationInterfaceQueue(), but presumably users who are indexing the entire transaction history are prepared to wait a bit longer).

jgarzik commented at 10:08 pm on November 25, 2019: contributor

addr-index is the most frequently patched-in change to Core, across the years, IME.

dan-da commented at 10:21 pm on November 25, 2019: none

querying a 3rd party entails a loss of privacy and requires trusting the 3rd party to return truthful data.

jamesob commented at 4:45 pm on November 26, 2019: member

Concept ACK.

I know I’ve flip-flopped on this feature historically, but I agree very much with the points that @jnewbery makes. Trading the maintenance burden of ~800 lines of relatively non-consensus-risky indexing code for some very practical gains in user privacy is compelling to me. Even with the caveat that we don’t think some of these uses are architecturally sound for years and years.

One might make the argument that there’s a slippery slope here - if we consent to this index now, would demand for yet another index come along, and the next and so on? I don’t think so. In the few years I’ve followed this, there has been reliable demand for an address index (and numerous community implementations of varying repute) and nothing else - aside of course from the compact block index - so I don’t think a slippery-slope argument holds. @marcinja do you have time to rebase this PR?

marcinja commented at 11:18 pm on November 26, 2019: contributor

Yes, I will rebase this and clean it up a bit. Thanks for the conceptual review.

ryanofsky commented at 2:18 pm on November 27, 2019: member

Maybe we should log a warning when the address index is enabled, like:

Could also add a coda if the built in wallet is enabled:

The idea would be to give end users relying on third party accounting tools a way of knowing about the limitations of those tools, and encouraging authors to implement more scalable approaches.

promag commented at 11:51 pm on December 1, 2019: member

Concept ACK.

marcinja force-pushed on Dec 3, 2019

DrahtBot removed the label Needs rebase on Dec 3, 2019

marcinja force-pushed on Dec 3, 2019

marcinja force-pushed on Dec 4, 2019

marcinja force-pushed on Jan 2, 2020

jnewbery added the label Review club on Jan 3, 2020

marcinja force-pushed on Jan 3, 2020

ryanofsky commented at 4:31 pm on January 6, 2020: member

Nice set of notes on this PR at https://bitcoincore.reviews/14053.html

in src/Makefile.am:145 in c21767116b outdated

134@@ -135,6 +135,8 @@ BITCOIN_CORE_H = \
135   httpserver.h \
136   index/base.h \
137   index/blockfilterindex.h \
138+  index/addrindex.h \
139+  index/disktxpos.h \

jnewbery commented at 9:14 pm on January 7, 2020:

nit: this is added in the wrong commit (should be in Move only: Move CDiskTxPos to own file, not Add address index)

in src/index/addrindex.h:32 in c21767116b outdated

27+
28+private:
29+    const std::unique_ptr<DB> m_db;
30+
31+    // m_hash_seed is used by GetAddrID in its calls to MurmurHash3.
32+    // It is stored in the index, and restored from their on construction

jnewbery commented at 9:19 pm on January 7, 2020:

spelling: there

in src/index/addrindex.cpp:23 in c21767116b outdated

18+
19+#include <boost/thread.hpp>
20+
21+std::unique_ptr<AddrIndex> g_addr_index;
22+
23+static constexpr char DB_ADDR_INDEX = 'a';

jnewbery commented at 9:38 pm on January 7, 2020:

I’m not sure if this is actually needed. The other indexes have different prefixes for the different object types that they store, but all objects in this index are given the same prefix, so is it necessary?

in src/index/addrindex.cpp:126 in c21767116b outdated

121+        DBKey key;
122+        DBValue value;
123+        if (!iter->GetKey(key) || key.m_addr_id != addr_id || !iter->GetValue(value) ) break;
124+
125+        // Check that the stored script matches the one we're searching for, in case of hash collisions.
126+        if (value.second != script) continue;

jnewbery commented at 9:44 pm on January 7, 2020:

Does this branch need to advance the iterator? If not, I think value will be the same the next time round and we’ll never break out of this loop.

I think the following is what we want:

0    ...
1    if (value.second == script) {
2        result.emplace_back(std::make_pair(key, value));
3    }
4
5    iter->Next();
6}

in src/index/addrindex.cpp:178 in c21767116b outdated

173+{
174+    CBlockUndo block_undo;
175+    CDiskTxPos pos(pindex->GetBlockPos(), GetSizeOfCompactSize(block.vtx.size()));
176+    std::vector<std::pair<DBKey, DBValue>> entries;
177+
178+    const bool not_genesis_block = (pindex->nHeight > 0);

jnewbery commented at 9:48 pm on January 7, 2020:

style nit: having a variable defined as not_thing seems unintuitive to me. I think it’d be better to define genesis and then test on !genesis.

in src/index/addrindex.cpp:209 in c21767116b outdated

192+        // Skip coinbase inputs.
193+        if (not_genesis_block && i > 0) {
194+            const CTxUndo& tx_undo = block_undo.vtxundo[i-1];
195+            for (size_t k = 0; k < tx.vin.size(); ++k) {
196+                CScript spent_outputs_scriptpubkey = tx_undo.vprevout[k].out.scriptPubKey;
197+                DBKey key(DBKeyType::SPENT, GetAddrId(spent_outputs_scriptpubkey), tx.vin[k].prevout);

jnewbery commented at 9:57 pm on January 7, 2020:

I don’t think this isn’t the information that we want to save in the address index. For a given scriptPubKey, a user wants to know:

which txouts (txid, output index) spent to that scriptPubKey (the CREATED DBKeyType above)
which txins (txid, input index) consume UTXOs for that scriptPubKey (the SPENT DBKeyType here)

So here, I think you want to save the txid and input index spending the coin. You’re actually saving the txid and output index that creates the coin, because you’re using the prevout.

(I’m not entirely sure about this. Perhaps you are trying to return the outpoint that created the coin in the spent outputs array, but that’s not clear to me from the RPC documentation).

narula commented at 5:50 pm on January 8, 2020:

I think this is correct because he is using the prevout in the key. When looking up by this script with this outpoint, one would want to find this transaction because it spends this outpoint.

AFAICT the value just contains the position of the relevant transaction on disk and the scriptPubKey (to detect collisions). The value does not contain any indexes into inputs or outputs.

narula commented at 5:55 pm on January 8, 2020:

Can you add a comment with your key/value format somewhere and the justification?

in src/rpc/rawtransaction.cpp:227 in c21767116b outdated

222+            "\nRequires -addrindex to be enabled.\n"
223+            "\nArguments:\n",
224+                {
225+                    {"address", RPCArg::Type::STR_HEX, RPCArg::Optional::NO, /* default_val */ "", "address or scriptPubKey"},
226+                    {"verbose", RPCArg::Type::BOOL, RPCArg::Optional::OMITTED, /* default_val */ "false", "If false, return hex-encoded tx, otherwise return a json object"},
227+                    {"count", RPCArg::Type::NUM, RPCArg::Optional::OMITTED, /* default_val */ "", "The block in which to look for the transaction"},

jnewbery commented at 10:02 pm on January 7, 2020:

“The block in which to look for the transaction” is wrong

andrewtoth commented at 2:50 am on January 8, 2020:

The default value is implicitly 100. It should be explicitly defined here.

in src/rpc/rawtransaction.cpp:232 in c21767116b outdated

227+                    {"count", RPCArg::Type::NUM, RPCArg::Optional::OMITTED, /* default_val */ "", "The block in which to look for the transaction"},
228+                },
229+                {
230+                    RPCResult{"if verbose is not set or set to false",
231+            "\nResult:\n"
232+            " [                                    (array of json objects)\n"

jnewbery commented at 10:10 pm on January 7, 2020:

I suggest you make this an object with two keys: creates and spends

jnewbery commented at 10:20 pm on January 7, 2020: member

I’ve done a preliminary review and left a few comments.

I think this PR could use:

some design documentation - what’s being stored as keys and values in the addr index database and why, and what values are being returned to the RPC user. This could be a code comment at the top of addrindex.cpp or addrindex.h.
more test cases, particularly of the spent outputs values (there aren’t currently any tests for that) and persistence across stop-start. The functional test can be moved to its own file, rather than being added to rpc_rawtransaction.py.

in src/rpc/rawtransaction.cpp:218 in c21767116b outdated

213@@ -213,6 +214,142 @@ static UniValue getrawtransaction(const JSONRPCRequest& request)
214     return result;
215 }
216 
217+static UniValue searchrawtransactions(const JSONRPCRequest& request) {
218+    if (request.fHelp || request.params.size() < 1 || request.params.size() > 4)

andrewtoth commented at 2:45 am on January 8, 2020:

0    if (request.fHelp || request.params.size() < 1 || request.params.size() > 3)

MarcoFalke commented at 5:56 pm on January 10, 2020:

Should use the new .Check() syntax instead

in src/index/addrindex.cpp:220 in c21767116b outdated

215+
216+// FindTxsByScript fills the spends_result vector with outpoints corresponding
217+// to the output spent with the given script, and the transaction it was spent
218+// in. creations_result is filled with outpoints for outputs created with this
219+// script as their script pubkey, and the transactions they were created in.
220+bool AddrIndex::FindTxsByScript(const CScript& script,

andrewtoth commented at 3:09 am on January 8, 2020:

Shouldn’t this method take the count as a parameter and return early when enough entries are found?

in src/rpc/rawtransaction.cpp:339 in c21767116b outdated

321+            std::string hex_tx = EncodeHexTx(*(spend.second.first), RPCSerializationFlags());
322+            tx_val.pushKV("hex", hex_tx);
323+        }
324+        spends.push_back(tx_val);
325+        ++spends_it;
326+    }

andrewtoth commented at 3:13 am on January 8, 2020:

If there are more spends than count then no creates will be returned. Ideally the spends and creates will be returned in order, rather than all spends first.

in src/rpc/rawtransaction.cpp:223 in c21767116b outdated

213@@ -213,6 +214,142 @@ static UniValue getrawtransaction(const JSONRPCRequest& request)
214     return result;
215 }
216 
217+static UniValue searchrawtransactions(const JSONRPCRequest& request) {

andrewtoth commented at 3:14 am on January 8, 2020:

There doesn’t seem to be a way to skip transactions anymore. Was this removed intentionally?

in src/index/addrindex.cpp:123 in e739288bd5 outdated

118+    std::unique_ptr<CDBIterator> iter(NewIterator());
119+    iter->Seek(search_key);
120+    while (iter->Valid()) {
121+        DBKey key;
122+        DBValue value;
123+        if (!iter->GetKey(key) || key.m_addr_id != addr_id || !iter->GetValue(value) ) break;

narula commented at 4:31 pm on January 8, 2020:

Do you want to log or throw something here? If I understand the LevelDB docs correctly, you have what LevelDB says is a valid iterator which should be positioned at the key. I think if any of these return false you’d want to know as the DB could be corrupted.

narula commented at 4:40 pm on January 8, 2020:

I think in general it would be nice if this class did some of the error logging that I see in blockfilterindex.cpp.

in test/functional/rpc_rawtransaction.py:494 in c21767116b outdated

485@@ -487,6 +486,44 @@ def run_test(self):
486         assert_equal(testres['allowed'], True)
487         self.nodes[2].sendrawtransaction(hexstring=rawTxSigned['hex'], maxfeerate='0.20000000')
488 
489+        self._test_searchrawtransactions()
490+
491+    def _test_searchrawtransactions(self):

narula commented at 5:31 pm on January 8, 2020:

Please also test the spends_results path (returning transactions which spend from this address).

jonatack commented at 6:00 pm on January 8, 2020: member

Still gathering context here in light of the Concept ACKs others have given. Nevertheless, on first look I’m finding it difficult to ack the concept of investing resources on reviewing, adding, and maintaining a feature that will need to be removed down the road in favor of a dedicated external index by interested parties.

fjahr commented at 6:39 pm on January 8, 2020: member

For me what this PR is missing some analysis on the current size of the index in mainnet and the growth rate of the data. Has anyone run the numbers recently and can report?

in src/rpc/rawtransaction.cpp:314 in c21767116b outdated

309+    while (spends_it != spends_result.end() && count--) {
310+        const auto& spend = *spends_it;
311+        UniValue tx_val(UniValue::VOBJ);
312+
313+        UniValue outpoint_val(UniValue::VOBJ);
314+        outpoint_val.pushKV("txid", spend.first.hash.GetHex());

marcinja commented at 6:53 pm on January 8, 2020:

0        outpoint_val.pushKV("txid", spend.second.first.hash.GetHex());

I’ll write a functional test that catches this too :)

in src/index/addrindex.cpp:267 in c21767116b outdated

246+        std::pair<CTransactionRef, uint256> result =  std::make_pair(tx, header.GetHash());
247+
248+        // Place entry into correct vector depending on its type.
249+        switch (key.m_key_type) {
250+            case DBKeyType::SPENT:
251+                spends_result.emplace_back(std::make_pair(key.m_outpoint, result));

narula commented at 6:56 pm on January 8, 2020:

As pointed out by jnewbery in review club, I think this is wrong.

The DB’s key always represents the outpoint that created the scriptPubKey. For spends_result, you want the txid of the transaction that spends the scriptPubKey. So instead of the DB key’s outpoint as the transaction id to be returned to the user, you want the transaction id of the transaction in the DB value. Or, in the RPC code lines 314 and 315, use the txid from spend.second.

Edit: Or maybe this field of the returned result is always the created_outpoint, even though it’s called the spent_outpoint sometimes? Would be good to clarify this.

fjahr commented at 8:19 pm on January 8, 2020: member

I think @jnewbery was hinting at this in the PR Review Club as well: it seems like this index does not handle reorgs. BaseIndex handles this through Rewind which is not overridden.

andrewtoth commented at 2:49 am on January 10, 2020: contributor

Ran on i7-8750H, took about 48 hours on an already synced node. Took 200 GB disk space.

02020-01-08T02:22:01Z addr_index thread start
1...
22020-01-10T02:37:07Z addr_index is enabled at height 611001
32020-01-10T02:37:07Z addr_index thread exit

0$ du -h addr_index
1200G	addr_index

DrahtBot added the label Needs rebase on Jan 22, 2020

luke-jr commented at 3:06 am on January 29, 2020: member

Might be a good idea to replace CDiskTxPos with a height+tx# for [future?] pruning compatibility, like #13014

luke-jr commented at 3:07 am on January 29, 2020: member

I think @jnewbery was hinting at this in the PR Review Club as well: it seems like this index does not handle reorgs. BaseIndex handles this through Rewind which is not overridden.

Handling reorgs might actually be undesirable for this… but changing CDiskTxPos like I just suggested to a block height could be incompatible with saving references to blocks not in the main chain too. (Maybe storing height OR blockhash would work?)

marcinja force-pushed on Feb 12, 2020

DrahtBot removed the label Needs rebase on Feb 12, 2020

marcinja force-pushed on Feb 12, 2020

DrahtBot added the label Needs rebase on Apr 6, 2020

Talkless commented at 12:56 pm on April 8, 2020: none

@marcinja Do you believe this index could allow wallets such as Electrum to avoid relying on extra applications (like ElectrumX, Electrum Private Server, etc), and (if leared how to speak Bitcoin RPC) use Bitcoin Core node directly?

baso10 commented at 8:18 am on June 28, 2020: none

Any update when this could be merged?

fjahr commented at 9:49 am on June 28, 2020: member

Any update when this could be merged?

I kind of thought that this was still in conceptual review but it has received 5 Concept ACKs from experienced Core developers and I did not see much of the push back this has received historically, so I tend to believe this can be moved out of Conceptual review phase and move forward. There is still a lot of work to be done in terms of testing, documentation etc. @marcinja If you are currently unable to work on this I would be happy to take this over since I have some experience with indexes already :)

luke-jr commented at 10:00 am on June 28, 2020: member

Past NACKs don’t disappear just because a new PR is opened…

fjahr commented at 10:12 am on June 28, 2020: member

Past NACKs don’t disappear just because a new PR is opened…

Definitely agree, especially if the PRs are very close. But I think if a new PR is opened with a different approach it can still be expected for the reviewers who gave NACKs in the past, to participate in the discussion again and reinstate their NACK or someone else who agrees with their view should do it for them. Otherwise, we might never be able to move forward if a concept is NACKed and then the reviewer leaves the project or just does not comment on the new approach. But it would probably be good if @marcinja would address the past NACKs directly and give those reviewers another chance to say if they have changed their minds or not.

MarcoFalke commented at 11:22 am on June 28, 2020: member

Concept ACK

jonatack commented at 11:49 am on June 28, 2020: member

Any update when this could be merged? @baso10

This PR needs to be rebased, as you can see in #14053 (comment)
The PR is still going through peer review – see the discussion above
See https://github.com/bitcoin/bitcoin/blob/master/CONTRIBUTING.md#peer-review for more info about the peer review process in this project

jonatack commented at 11:49 am on June 28, 2020: member

A good resource for more context and history and reasons for and against this addition: https://bitcoincore.reviews/14053.

marcinja force-pushed on Jul 8, 2020

DrahtBot removed the label Needs rebase on Jul 8, 2020

marcinja force-pushed on Jul 8, 2020

DrahtBot added the label Needs rebase on Jul 14, 2020

marcinja force-pushed on Jul 14, 2020

DrahtBot removed the label Needs rebase on Jul 15, 2020

marcinja commented at 1:54 am on July 15, 2020: contributor

Hey everyone. Just wanted to give a small update here. I am still planning on keeping this PR up-to-date and rebased. I’ve realized it’s more annoying to rebase after procrastinating, and also it’s not fair to the reviewers and other people interested so I will be much more diligent about rebasing now.

The total size of the address index right now is 223GB. I don’t have a good answer on index sync time because I was re-syncing on a laptop. This was done on the 8f3fc6b4b8fb1a2db3035398d3dd5a2ed91cc3f9 branch over the course of a few days. I’m going to time an IBD and addrindex sync on a desktop with a good SSD so as to get meaningful results.

The good news is that searchrawtransactions is not prohibitively slow.

0time ./src/bitcoin-cli searchrawtransactions 3EiAcrzq1cELXScc98KeCswGWZaPGceT1d true 30000 > big-search
1
2./src/bitcoin-cli searchrawtransactions 3EiAcrzq1cELXScc98KeCswGWZaPGceT1d     4.79s user 1.95s system 49% cpu 13.575 total

(the file big-search created here is about 366MB)

I have also confirmed that the address index has the same number of created/spent outpoints as on this explorer: https://blockstream.info/address/3EiAcrzq1cELXScc98KeCswGWZaPGceT1d

DrahtBot added the label Needs rebase on Jul 30, 2020

marcinja force-pushed on Jul 30, 2020

DrahtBot removed the label Needs rebase on Jul 31, 2020

fanquake commented at 3:19 am on July 31, 2020: member

@marcinja If you feel like it, you could pull 2ce521784432550007c0df1da85b4d3d0b1c7477 and 2761d11cc798d42839676f98250781a3ea445ce2 out of here and PR separately.

in src/index/addrindex.cpp:41 in 75e044564b outdated

36+ * key_type is SPENT when the outpoint stored in the key is spent, i.e. it is
37+ * used as a prevout in a transaction input.  It is CREATED when the outpoint is
38+ * created as a new  COutpoint in that transaction.
39+ *
40+ * outpoints are stored in the key as opposed to in the DB value to preserve
41+ * uniqueness and to support multiple values for a a single addr_id and key_type

romanz commented at 5:16 pm on August 6, 2020:

0 * uniqueness and to support multiple values for a single addr_id and key_type

in src/index/addrindex.cpp:30 in 75e044564b outdated

25+ * stores one unique global value under the DBKeyType::SEED key that seeds the
26+ * MurmurHash3 hasher used to create AddrIds.
27+ *
28+ * The DB keys are structured as follows: <addr_id, key_type, outpoint>
29+ *
30+ * addr_id is the hash of the script_pub_key computed using MurmurHash3, a a

romanz commented at 5:16 pm on August 6, 2020:

0 * addr_id is the hash of the script_pub_key computed using MurmurHash3, a

DrahtBot added the label Needs rebase on Aug 13, 2020

marcinja force-pushed on Aug 15, 2020

DrahtBot removed the label Needs rebase on Aug 15, 2020

MarcoFalke referenced this in commit 1bc8e8eae2 on Aug 17, 2020

DrahtBot added the label Needs rebase on Aug 17, 2020

sidhujag referenced this in commit 4e2e81b871 on Aug 18, 2020

marcinja force-pushed on Aug 19, 2020

DrahtBot removed the label Needs rebase on Aug 19, 2020

laanwj referenced this in commit 27eeb0337b on Aug 20, 2020

sidhujag referenced this in commit defff93348 on Aug 20, 2020

fanquake referenced this in commit 0d9e14a646 on Aug 21, 2020

DrahtBot added the label Needs rebase on Aug 21, 2020

sidhujag referenced this in commit e954a012af on Aug 21, 2020

marcinja force-pushed on Aug 22, 2020

DrahtBot removed the label Needs rebase on Aug 22, 2020

DrahtBot added the label Needs rebase on Sep 22, 2020

marcinja force-pushed on Sep 30, 2020

DrahtBot removed the label Needs rebase on Sep 30, 2020

Add address index

Adds index that relates scriptPubKeys to location of transactions in the
filesystem, along with the hash of the block they are found in, and the
outpoint information of the txout with the related script.

4c4c0cf239

Initialize address index

Setup address index in initialization process.
Add initialization warning and wallet feature request warning as
suggested by ryanofsky.

545bc46a22

Add address index test 0f85eb9043

Add searchrawtransactions RPC

Adds searchrawtransactions RPC that uses the address index to lookup
transactions and outpoints by script and address. Adds basic functional
tests for searchrawtransactions.

cd274024b8

marcinja force-pushed on Oct 1, 2020

c78867886 commented at 7:01 am on October 13, 2020: none

Can someone plz approve this?

decryp2kanon commented at 1:48 am on November 8, 2020: contributor

Concept ACK

Talkless commented at 12:47 pm on December 13, 2020: none

@marcinja will this allow other wallets like Electrum to utilize this feature (RPC credentials provided, of course) and avoid having to run ElecturmX (https://github.com/spesmilo/electrumx), EPS (https://github.com/chris-belcher/electrum-personal-server) or BWT (https://github.com/shesek/bwt) intermediary software? @ecdsa @SomberNight @chris-belcher @shesek could you provide your input on how this feature might be useful (or not)?

romanz commented at 2:30 pm on December 13, 2020: contributor

The main issue AFAIU is that Electrum is using SHA256(scriptPubKey) while this PR is using MurmurHash3(scriptPubKey). Also, ElectrumX & electrs are using RocksDB for the index storage - resulting in better performance and disk usage (compared to LevelDB).

SomberNight commented at 3:04 pm on December 13, 2020: contributor

@Talkless note that while some Electrum users run their own bitcoind, many do not. Electrum wants to support both use cases, and in fact the suspicion is that most users just use a public server. When using a public server, the client cannot use bitcoind RPC, hence in that case I don’t see how the middleware (e.g. ElectrumX) could be avoided.

For the own bitcoind use case, maybe the client could have another optional mode of operation where it uses bitcoind RPC directly, which is I guess what you are asking about. For that, an address-index in bitcoind is the main thing missing indeed, however not the only one. For one, the electrum protocol (the client<->“middleware server” connection) has address subscriptions - the client gets a notification when a history of one its addresses changes. We are also planning on soon adding another method into the protocol that allows txoutpoint->spender_txid lookups (and notifications); I guess that could be implemented using the index in this PR albeit in a very inefficient way for heavily reused addresses.

IMHO there are multiple upsides for having this middleware setup for Electrum:

for the project, it keeps the codebase simpler (again, we want to support users without own bitcoind)
for the project, it also allows for more flexibility for implementing new functionality: just consider present PR here, we have needed such an index for 9 years but bitcoind did not have it or want it, so we could just implement it ourselves
for the server operator, even if they don’t want to open up the server for the public, they could share it with their friends and family. I don’t think that’s feasible with bitcoind RPC. I think this is a common use case.

Nevertheless if someone steps up and contributes patches, this kind of thing could be added.

The total size of the address index right now is 223GB

That sounds much larger than expected. Even with the txoutpoint->spender_txid map I mentioned above, when using LevelDB, ElectrumX uses around 90 GiB of disk space.

The DB keys are structured as follows: <addr_id, key_type, outpoint> The DB values are simply: <CDiskTxPos, CScript>

Instead of having both addr_id and CScript, why not just put a long hash, e.g. sha256(CScript) into the key?

Also, I expect most users of this index would also want txindex enabled. You might want to consider making address index dependent on txindex. Have you investigated how much space that would save? There would be no need to store CDiskTxPos.

Another trick that ElectrumX uses is that only the best chain is indexed. We have a tx_num->txid map as raw files on disk. A tx_num is the index of the transaction in the linear history of the blockchain, a 5-byte integer (so e.g. the genesis block coinbase tx has tx_num=0). This map uses around 17 GiB atm. With this, you can encode the txid part of the outpoint as 5 bytes instead of 32 bytes.

laanwj removed this from the "Chasing Concept ACK" column in a project

MarcoFalke commented at 7:09 pm on December 17, 2020: member

Concept ACK (might have already done that)

sipa commented at 7:25 pm on December 17, 2020: member

I’m concept -0 on this.

My primary objection is that I think it’s a bad idea for any infrastructure to be built all that relies on having fully-indexed blockchain data available (this also applies to txindex, but we can’t just remove support…).

However, it seems many people want something like this, and are going to use it anyway. The question is then whether it belongs in the bitcoin-core codebase. Alternative, and more performant presumably, like electrs exist already too, so it isn’t exactly impossible to do this elsewhere.

Still, given that we now have the indexes infrastructure, it means that things like this are easy to add in a fairly modular way without invading consensus code. So if people really want this, fine.

Overall approach comment: I don’t think MurmurHash should be used for anything new; there are strictly better hash functions available. I’d suggest SipHash if that’s fast enough.

jamesob commented at 7:42 pm on December 17, 2020: member

I’m also a little more negative on having this in Core than I previously was. After working in a few industrial contexts on wallet stuff, it’s clear to me that an address index is really only required if you want to implement a block explorer or do chain analysis. For both of these applications, using something like electrs seems sufficient.

For personal wallet management, a full address index is not required. I think the origin of some confusion is that things like the Electrum Personal Server have become synonymous with this kind of usage, but in reality a full index is overkill when descriptor-specific rescans can be done once for a historical backfill and then per-block scanning can be done from there on out.

I want to point out that this is a nice implementation and good work by @marcinja, but I’m leaning slightly against the inclusion of such an index in Core at this point.

jonasschnelli commented at 7:48 pm on December 17, 2020: contributor

My primary objection is that I think it’s a bad idea for any infrastructure to be built that relies on having fully-indexed blockchain data available.

I agree on this.

IMO the only use cases to ever use a full address index are:

1. Instant seed/xpriv backup recovery including spent history
1. Backend service for thousands of wallets
1. Debug/explore purposes

1 (instant backup recovery) could be solved with either scantxoutset (take a minute or two) or by an address-index for the utxo set only. But both would not restore the spent history. A scalable non-enterprise solution to restore the spent history is using blockfilters. Scan through the filters and rescan only the relevant blocks (a matter of minutes), see #20664.

2 (a backend for thousands of wallets): out of scope for this project.

3 (explore purposes): I think this is a valid use case. Though adding this PR to Bitcoin Core will lead to many many projects using it in production increasing the traffic in this project and eventually steal time from existing contributors (rebase, maintenance, drag-along)

My main fear is that people are going to use this index (a full address index) to use it as an electrum(ish) backend for a handful of wallets.

With multiwallet, watch-only-wallets, PSBT, we have all tools to server multiple wallets in a scalable way for external applications.

I also think merging this as it is, would be in contradiction to the process- and repository-separation effort.

Therefore I’m ~0 (slightly towards NACK) to add this. If this would be in another repository (still under bitcoin/*) and process separated, I would ACK it.

marcinja commented at 4:21 pm on January 4, 2021: contributor

Hi all, thanks for the feedback and review. This was an enjoyable PR to work on and I learned a lot from all your comments.

I’m closing this PR because its size probably requires stronger support from contributors to get in. It also seems more clear now that all of the practical use-cases are covered by existing features and some lightweight alternatives (https://github.com/bitcoin/bitcoin/pull/20664) .

I also agree that it would be bad to incentivize using an address index to support an external electrum wallet, when it’s not the intended use-case and would cause unnecessary burden on contributors and maintainers in this project, e.g. from users of those wallets wanting new features or updates.

marcinja closed this on Jan 4, 2021

jonatack commented at 4:53 pm on January 4, 2021: member

@marcinja thank you and I hope to see more contributions of this quality from you.

vijaydasmp referenced this in commit 909c5fa563 on Sep 15, 2021

vijaydasmp referenced this in commit f3165ddc45 on Sep 16, 2021

PastaPastaPasta referenced this in commit 0e3b75d391 on Sep 17, 2021

PastaPastaPasta referenced this in commit 4526efc6ac on Sep 17, 2021

vijaydasmp referenced this in commit 736421f02e on Sep 20, 2021

vijaydasmp referenced this in commit a97683ff3a on Sep 20, 2021

vijaydasmp referenced this in commit feddd3c0c2 on Sep 20, 2021

vijaydasmp referenced this in commit 71df8f25db on Sep 20, 2021

vijaydasmp referenced this in commit a489420e16 on Sep 22, 2021

vijaydasmp referenced this in commit 829aeff0f6 on Sep 22, 2021

kittywhiskers referenced this in commit c7dff0d165 on Oct 12, 2021

kittywhiskers referenced this in commit c1d2c79f4d on Oct 12, 2021

gades referenced this in commit bef7b2508d on May 2, 2022

gades referenced this in commit cf54acf2ce on May 2, 2022

DrahtBot locked this on Aug 16, 2022

Add address-based index (attempt 4?) #14053

Conflicts