[WIP] Index for UTXO Set Statistics #18000

pull fjahr wants to merge 13 commits into bitcoin:master from fjahr:utxo-stats-index-rebase changing 24 files +1347 −67
  1. fjahr commented at 11:07 pm on January 24, 2020: member

    This PR will not be updated anymore because the project is now split up into multiple pull requests. But I will use it to keep track of the projects PRs:

    Content PR Status Review Club
    Add Muhash and SHA256Writer (unused) #19055 Merged notes
    Add Muhash Python implementation (unused) #19105 Merged notes
    Add hash_type NONE (Muhash code independent) #19328 Merged -
    Add hash_type MUHASH #19145 Merged ?
    Add ASM optimizations for MuHash #19181 Open ?
    Add CoinStatsIndex #19521 Merged ?
    Parallelize gettxoutsetinfo
    Remove legacy UTXO set hash (future)

    Since it’s starting to get confusing I am trying to visualize the dependencies of the open PRs in this depency graph:

     0+--------------+   +-------------+   +----------+
     1|#19328        |   |#19105       |   |#19055    |
     2|hash_type NONE|   |Muhash Python|   |Muhash CPP|
     3+----+-----+---+   +------+------+   +-+----+---+
     4     ^     ^              ^            ^    ^
     5     |     |              |            |    |
     6     |     |    +---------+------+     |    |
     7     |     +----+#19145          +-----+    |
     8     |          |hash_type MUHASH|          |
     9     |          +----------------+          |
    10     |                               +------+---+
    11+----+--------+                      |#19181    |
    12|#19521       |                      |MuHash ASM|
    13|No hash Index|                      +----------+
    14+-------------+
    

    This implements an index of coin statistics with the goal of making the response time of the gettxoutsetinfo RPC call dramatically faster. Currently, this RPC is scanning the full UTXO set every time it is called which makes it hard to use for users that want to continually check the coin supply or compare UTXO set hashes between different nodes. It is especially challenging in periods of multiple quickly mined blocks, even relatively fast machines.

    Implementation overview:

    • The current serialization of the UTXO set for the purpose of hashing is changed based on sipa’s concept in proposed on the mailing list in 2017 and #10434
    • The hashing algorithm in for the UTXO set is changed to Muhash, which was also implemented by sipa
    • CoinStatsIndex is added which keeps an index of all values that would require gettxoutsetinfo to scan the UTXO set
    • CoinStatsIndex can be activated through the flag -coinstatsindex

    Todos/Open questions:

    • The transactions count is currently not implemented as it seems not knowable ex-post without running a full IBD to build up the index. I am looking for a solution to this. Ideas are:
      • Require txindex and the transactions count from it
      • Remove transactions count or mark it as unreliable in a different way if coinstatsindex is enabled
    • The transactions count question is also interesting because it seems to be the only obstacle to providing access to historical coin statistics, which may be a nice follow-up feature if transaction counts can be ignored somehow
    • More benchmarking to evaluate potential switch from Muhash to ECMH as the hashing algorithm
    • IBD benchmarking with the index enabled
    • Tests could be improved

    This is an extension to my rolling UTXO set hash proposal from last year.

    Edit: 3c5c1ca should probably be squashed into the prior commits but I wanted to leave sipa’s commits unchanged for the start.

  2. fanquake added the label UTXO Db and Indexes on Jan 24, 2020
  3. DrahtBot commented at 0:37 am on January 25, 2020: member

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #18795 (Test: wallet issue with orphaned rewards by domob1812)
    • #18354 (Protect wallet by using shared pointers by bvbfan)
    • #17786 (refactor: Nuke policy/fees->mempool circular dependencies by hebasto)
    • #13462 (Make SER_GETHASH implicit for CHashWriter and SerializeHash by Empact)
    • #10443 (Add fee_est tool for debugging fee estimation code by ryanofsky)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  4. fjahr force-pushed on Jan 25, 2020
  5. fjahr force-pushed on Jan 26, 2020
  6. fjahr force-pushed on Jan 26, 2020
  7. fjahr force-pushed on Jan 26, 2020
  8. fjahr force-pushed on Jan 26, 2020
  9. fjahr force-pushed on Jan 27, 2020
  10. emilengler commented at 5:27 pm on January 27, 2020: contributor

    I’m reviewing and testing it right now but from some benchmarks I did, I saw how the time goes down by 50%. However the performance increases by 19% percent.

    Regtest results only right now

    On master (ef8e2cee9f5d157eeb3139b64e9c3a5fa4bf36f3):

    0$ time src/bitcoin-cli --regtest gettxoutsetinfo
    1...
    2> src/bitcoin-cli --regtest gettxoutsetinfo  0,00s user 0,00s system 23% cpu 0,014 total
    

    With this PR

    0$ time src/bitcoin-cli --regtest gettxoutsetinfo
    1...
    2> src/bitcoin-cli --regtest gettxoutsetinfo  0,00s user 0,00s system 42% cpu 0,007 total
    
  11. ryanofsky commented at 8:03 pm on January 27, 2020: member
    Can you say more about current use cases of the gettxoutsetinfo function and maybe future use cases if it’s significantly faster with this change?
  12. sipa commented at 9:18 pm on January 27, 2020: member
    I think it’s perfectly fine to drop the “transactions” statistics. It doesn’t really mean anything; it just happened to be easy to compute pre-pertxout, but even now it’s mostly a hack.
  13. fjahr commented at 11:40 pm on January 27, 2020: member

    @emilengler Thanks for giving it a try! But I think the benchmarks you are comparing are not well suited in this case. I assume you are running a standard regtest network, maybe generated a few hundred blocks and made some transactions by hand. In this case the UTXO set is very small and iterating over it will be extremely fast, so that it would not surprise me if there is no measurable upside to using the index (maybe it will be even a little slower as you saw). But the goal of this index is that the time of the call will stay constant for a realistically sized UTXO set. So a better indicator would be to try it on a synced mainnet or testnet node, where gettxoutset currently takes several minutes without the index, even on pretty high powered hardware. @ryanofsky Sure, I will go along the different statistics the call provides to highlight which ones I think are more or less important:

    • total_amount The total amount of the current bitcoin supply. Tracking this number on a continual basis allows users to detect potential inflation bugs, such as CVE-2018-17144 (Example: Bitmex Research). I think all users should be given the chance to continually check this, even on more affordable hardware. I am currently running a mainnet node on an Odroid HC2 and gettxoutsetinfo takes 7min 50s. Specs are roughly comparable to a Raspberry Pi 4, so it can keep up well with the chain otherwise. If I wanted to run a continual inflation detection on this hardware, I would not be able to do so when the UTXO set has grown another 20-25% (ignoring hashrate fluctuations).
    • hash_serialized_2 The serialized hash of the UTXO set can be used for a quick comparison of the UTXO set between different notes. Essentially, it can be a stand-in for the latest blockhash and compared for sanity checks. BTCPayServer has been using it for their FastSync feature, which will probably be replaced with assume_utxo, where a UTXO set hash (not necessarily this one) will play a role as well. I still need to sync up with @jamesob about the latest plans for the hash and how this proposal can be compatible with his or even help/support it.
    • txouts The total number of unspent outputs is an interesting number to track from an operations standpoint. Users can see if coins are being consolidated or at which rate the UTXO set is growing. Example: satoshi.info
    • bogosize and disk_size might be also interesting to track but not as actionable, disc usage is probably much more efficient to collect from the OS
    • transactions it seems to be a nice-to-have metric that does not serve any real purpose afaict, as @sipa also noted
    • height and bestblock can be ignored since they are available from other RPCs, but are necessary to compare numbers between different machines

    For future use cases (aside from integration with assume_utxo potentially) I don’t have specific ideas yet, but it opens the door to track any other statistics of the UTXO set or block content in the index. So it may collect further metrics of interest where users currently parse the block data with their own custom software.

  14. emilengler commented at 7:39 pm on January 28, 2020: contributor

    @fjahr I’m a bit confused now, I ran the tests again on a full chain and master is MUCH faster than this Pull Request (4 MINUTES faster) On master (2755b2b1092d0286022cf3cc3028e96f6bee2b34)

     0$ time src/bitcoin-cli --datadir=/media/emil/Files/Bitcoin/ gettxoutsetinfo
     1{
     2  "height": 614943,
     3  "bestblock": "0000000000000000000fa03a2ce4c6d7fb763e55402deac32371b898e8faba5e",
     4  "transactions": 38747934,
     5  "txouts": 65245740,
     6  "bogosize": 4903711195,
     7  "hash_serialized_2": "011d3eccd148b68a3bf039f0695bb4d77a23ba208dcd4d8ac073d867fd5bbef6",
     8  "disk_size": 4035082036,
     9  "total_amount": 18186617.32125282
    10}
    11src/bitcoin-cli --datadir=/media/emil/Files/Bitcoin/ gettxoutsetinfo  0,00s user 0,01s system 0% cpu 57,428 total
    

    On fjahr:utxo-stats-index-rebase (663dbfbd1da60e91c9576629ace82b712d31668c)

     0$ time src/bitcoin-cli --datadir=/media/emil/Files/Bitcoin/ gettxoutsetinfo
     1{
     2  "height": 614937,
     3  "bestblock": "00000000000000000005973a14ac338b681e410922c7566e435107db1b1d807b",
     4  "transactions": 38745882,
     5  "txouts": 65244661,
     6  "bogosize": 4903634025,
     7  "utxo_set_hash": "87d3e7f807d6b428bf60e983a73137ef410cb510de1b5f4e650e081d044fc772",
     8  "disk_size": 4082447626,
     9  "total_amount": 18186542.32125282
    10}
    11src/bitcoin-cli --datadir=/media/emil/Files/Bitcoin/ gettxoutsetinfo  0,01s user 0,00s system 0% cpu 5:01,42 total
    
  15. ryanofsky commented at 7:22 pm on January 29, 2020: member

    @ryanofsky Sure, I will go along the different statistics the call provides to highlight which ones I think are more or less important:

    Thanks, this information is really clarifying. I wonder if some of it could be incorporated in the reference documentation (without mentioning specific CVE’s or software projects), or maybe if it would be useful as part of some wiki. In any case, this definitely helps motivate the PR, so concept ACK from me

  16. fjahr commented at 5:53 pm on January 30, 2020: member
    @emilengler did you actually start bitcoind with -coinstatsindex? I mention it briefly in the PR description, but as @ryanofsky already pointed out as well, I need to do a much better job at documenting the intent and current state of the proposal. It seems so because otherwise you could not test it that quickly, building up the index takes a quite long time and while the index is building up gettxoutsetinfo is responding with “unable to read UTXO set”. In fact, on my machine, it is currently so slow that I might have to address it. But running a plain IBD is also incredibly slow right now so it could also be a local issue. What is probably happening is that you are running bitcoind without the flag and so the stats are still calculated by iterating over the UTXO set but the hash is using the new hashing algorithm (Muhash), which is slower and results in slower performance when not using the index.
  17. emilengler commented at 6:55 pm on January 30, 2020: contributor

    @emilengler did you actually start bitcoind with -coinstatsindex?

    no, sorry oversaw that

  18. emilengler commented at 8:22 pm on January 30, 2020: contributor

    Hmmm I get this error message now:

    0error code: -32603
    1error message:
    2Unable to read UTXO set
    

    the chain state is about ~2 days old

  19. fjahr commented at 8:30 pm on January 30, 2020: member

    Hmmm I get this error message now:

    0error code: -32603
    1error message:
    2Unable to read UTXO set
    

    the chain state is about ~2 days old

    Yeah, that means the index is syncing at the moment. You should see the progress in the logs of bitcoind. Something like Syncing coinstatsindex with block chain from height XXXXX.

  20. in src/node/coinstats.cpp:58 in b5f112cb0a outdated
    53+        }
    54+
    55+        if (g_coin_stats_index->LookupStats(block_index, stats)) {
    56+            stats.hashBlock = pcursor->GetBestBlock();
    57+            {
    58+                LOCK(cs_main);
    


    ariard commented at 1:59 am on February 25, 2020:

    b5f112c

    Can’t you use block_index->nHeight here and avoid a lock ?


    fjahr commented at 11:09 pm on March 6, 2020:
    Done. Cleaned up that whole part considerably.
  21. in src/index/coinstatsindex.cpp:134 in 93db4a5018 outdated
    135+bool CoinStatsIndex::WriteBlock(const CBlock& block, const CBlockIndex* pindex)
    136+{
    137+    CBlockUndo block_undo;
    138+
    139+    // Ignore genesis block
    140+    if (pindex->nHeight > 0) {
    


    ariard commented at 2:17 am on February 25, 2020:

    93db4a5

    There is a coinbase output in the genesis block?


    fjahr commented at 11:10 pm on March 6, 2020:

    ariard commented at 3:14 am on March 22, 2020:
    Ah my point was you may need to add the coinbase output is in the utxo set but apparently it can’t be spent https://bitcoin.stackexchange.com/questions/10009/why-can-t-the-genesis-block-coinbase-be-spent. Good to know
  22. in src/index/coinstatsindex.cpp:165 in 93db4a5018 outdated
    160+            const auto& tx = block.vtx.at(i);
    161+            // TODO: deal with m_nTransactions
    162+
    163+            for (size_t j = 0; j < tx->vout.size(); ++j) {
    164+                // ignore segwit block OP_RETURN output with merkle root of witness tree
    165+                if (tx->IsCoinBase() && j > 0) break;
    


    ariard commented at 2:19 am on February 25, 2020:

    93db4a5

    Also, OP_RETURN outputs shouldn’t be considered (we remove them from utxo set IIRC) ?


    fjahr commented at 11:09 pm on March 6, 2020:
    You are right, fixed now!
  23. ariard commented at 2:25 am on February 25, 2020: member
    • benchmark: have you run gettxoutsetinfo with-and-without this PR on few UTXO sets across the years to check than performance holds linearly with utxo set growth (or a least give an intuition of it) ?

    • assume-utxo: As security model of assume-utxo lays on public audit of a hardcoded hash of a utxo set, lowering the bar to compute it will increase review process trustworthiness with a higher number of participants. So IMO this point is worthy enough to Concept ACK this proposal (though @jamesob may already have work here?)

    Quickly skimmed over the new code, it’s clean enough if we want later to memory-isolate indexes in their own space. Don’t have opinion on either MuHash or ECMH (but no lattice).

  24. fjahr force-pushed on Mar 6, 2020
  25. fjahr marked this as ready for review on Mar 6, 2020
  26. fjahr renamed this:
    [WIP] Coin Statistics Index
    Index for Coin Statistics
    on Mar 6, 2020
  27. fjahr commented at 11:16 pm on March 6, 2020: member

    First of all, thanks for taking early looks and comments @ariard , @ryanofsky, @emilengler , @jnewbery (offline). Your comments should be mostly addressed in the latest code changes and in further information below. With the latest fixes and performance improvements, this should be now ready for real review and testing.

    Short recap of changes (this changed slightly)

    • Introduces MuHash3072 and TruncatedSHA256Writer implementations (code by Pieter Wuille)
    • Uses above for MuHash hashing alogrithm for the UTXO set hash in gettxoutsetinfo
    • Removed tracking of the transactions count statistic
    • Introduces CoinStatsIndex class, an index for all the statistics in gettxoutsetinfo
    • Allows activating the index using the flag -coinstatsindex. When activated gettxoutsetinfo responds within seconds after the initial sync phase. Otherwise it works as before but will be slower to parse the UTXO set due to the change in the hashing algorithm.

    Benchmarking

    • For reference, on my benchmarking server and my local machine, IBD takes around 18-19 hours
    • I can provide more detailed numbers if required but so far I am pretty confident that it’s not a bottleneck even for users on under-powered nodes.

    How long does the sync of the CoinStatsIndex take?

    • On my benchmarking server syncing the Coinstatsindex after IBD took between 8.5 and 9.5h hours
    • On my local machine, I also synced testnet and it took 8h 20min.

    How about the growth of the UTXO set (ariard’s question above)?

    • For the sync time of the index should actually not make much of a difference if the UTXO set shrinks or grows, it should rather grow linearly with the size of the blockchain.
    • Instead, growth of the UTXO set causes gettxoutsetinfo without the index to run longer because it iterates over it every time and so it is one of the reasons why I think the index is worth integrating into core.

    TBD

    • Although its a lot of LOC I think the changes are still ok to be reviewed as one PR but if people would prefer me to split it into two or three smaller PRs that is easily doable.
    • The newly introduced metric coins count that is only used in Assume UTXO is not in the index and I don’t think there is a practical way to integrate it (similar problem to transaction count). That means it will probably not be possible to have this index running and create a new assume utxo set on the same node at the same time. I will need to discuss this with @jamesob to make sure this dealt with correctly one Assume UTXO is finished.
  28. fjahr force-pushed on Mar 7, 2020
  29. fjahr commented at 0:08 am on March 7, 2020: member
    Rebased
  30. fjahr renamed this:
    Index for Coin Statistics
    Index for UTXO Set Statistics
    on Mar 7, 2020
  31. fjahr force-pushed on Mar 15, 2020
  32. fjahr force-pushed on Mar 16, 2020
  33. fjahr force-pushed on Mar 16, 2020
  34. fjahr force-pushed on Mar 16, 2020
  35. fjahr force-pushed on Mar 17, 2020
  36. fjahr force-pushed on Mar 19, 2020
  37. fjahr force-pushed on Mar 20, 2020
  38. fjahr force-pushed on Mar 20, 2020
  39. fjahr force-pushed on Mar 20, 2020
  40. fjahr force-pushed on Mar 20, 2020
  41. fjahr force-pushed on Mar 21, 2020
  42. fjahr force-pushed on Mar 21, 2020
  43. fjahr commented at 6:33 pm on March 21, 2020: member
    Latest changes fixed/clarified lots of comments/documentation and finally made Travis happy (when it runs again).
  44. laanwj added this to the "Chasing Concept ACK" column in a project

  45. DrahtBot added the label Needs rebase on Mar 31, 2020
  46. fjahr force-pushed on Apr 2, 2020
  47. fjahr commented at 11:57 am on April 2, 2020: member
    Rebased.
  48. DrahtBot removed the label Needs rebase on Apr 2, 2020
  49. DrahtBot added the label Needs rebase on Apr 15, 2020
  50. fjahr force-pushed on Apr 15, 2020
  51. fjahr commented at 8:20 pm on April 15, 2020: member
    Removed the last commit which was just a small test improvement that became redundant after a recently merged change.
  52. DrahtBot removed the label Needs rebase on Apr 15, 2020
  53. DrahtBot added the label Needs rebase on Apr 17, 2020
  54. pierreN commented at 1:38 am on April 18, 2020: contributor

    Nice PR. I just tested this on my laptop (slow SSD, i7-6700HQ). No issue to report.

    On master, gettxoutsetinfo takes 55s to complete. On this branch less than a second. The coinstatsindex thread took around 7h30mn to complete.

  55. fjahr force-pushed on Apr 19, 2020
  56. fjahr commented at 3:29 pm on April 19, 2020: member

    Another rebase, no code changes.

    Thanks for testing @pierreN !

  57. DrahtBot removed the label Needs rebase on Apr 19, 2020
  58. in src/node/coinstats.cpp:24 in d22951b3f8 outdated
    21 {
    22     assert(!outputs.empty());
    23-    ss << hash;
    24-    ss << VARINT(outputs.begin()->second.nHeight * 2 + outputs.begin()->second.fCoinBase ? 1u : 0u);
    25     for (const auto& output : outputs) {
    26-        ss << VARINT(output.first + 1);
    


    Sjors commented at 2:24 pm on April 26, 2020:
    In d22951b: what was the original rationale for +1?

    fjahr commented at 2:42 pm on May 15, 2020:
    I am not sure, it seems to have been introduced here https://github.com/bitcoin/bitcoin/pull/2602/commits/e31aa7c9d7dd204b5658f20c19565eee308e35c1 but there was no rationale given.
  59. in src/node/coinstats.cpp:27 in d22951b3f8 outdated
    29+        COutPoint outpoint = COutPoint(hash, output.first);
    30+        Coin coin = output.second;
    31+
    32+        TruncatedSHA512Writer ss(SER_DISK, 0);
    33+        ss << outpoint;
    34+        ss << (uint32_t)(coin.nHeight * 2 + coin.fCoinBase);
    


    Sjors commented at 2:31 pm on April 26, 2020:
    In d22951b3f8d79b5ec2069961c084c8d27b6b13fa: this may be a good time to explain what coin.nHeight * 2 + coin.fCoinBase is good for. I guess it reserves 1 bit to indicate if this is a coinbase, but why do we care about that? Maybe BIP30 related?

    fjahr commented at 2:42 pm on May 15, 2020:
    Hm, it’s a good question. I have not questioned it beyond the fact that it’s a neat way to document height and coinbase in one uint32. fCoinbase doesn’t make a difference for BIP30 because those were all coinbases and the duplicated txs are filtered earlier already. So I don’t see a reason why we would need the information but if we can have it without taking up any extra space, why not? I think based on the commit message here I would guess the reasoning is that this information should be added to hash and it was added in one to minimize change: https://github.com/bitcoin/bitcoin/commit/d342424301013ec47dc146a4beb49d5c9319d80a
  60. in src/index/coinstatsindex.h:25 in 8e17009dcc outdated
    20+private:
    21+    std::string m_name;
    22+    std::unique_ptr<BaseIndex::DB> m_db;
    23+
    24+    MuHash3072 m_muhash;
    25+    uint64_t m_nTransactionOutputs;
    


    Sjors commented at 2:40 pm on April 26, 2020:
    nit: snake case, e.g. m_transaction_output_count https://github.com/bitcoin/bitcoin/blob/master/doc/developer-notes.md

    fjahr commented at 2:41 pm on May 15, 2020:
    done
  61. in src/index/coinstatsindex.cpp:59 in 8e17009dcc outdated
    54+    template<typename Stream>
    55+    void Unserialize(Stream& s)
    56+    {
    57+        char prefix = ser_readdata8(s);
    58+        if (prefix != DB_BLOCK_HEIGHT) {
    59+            throw std::ios_base::failure("Invalid format for block filter index DB height key");
    


    Sjors commented at 2:43 pm on April 26, 2020:
    “coin stats index” (same below)

    fjahr commented at 2:40 pm on May 15, 2020:
    done
  62. in src/index/coinstatsindex.cpp:159 in 8e17009dcc outdated
    154+        // Add the new utxos created from the block
    155+        for (size_t i = 0; i < block.vtx.size(); ++i) {
    156+            const auto& tx = block.vtx.at(i);
    157+
    158+            // Skip duplicate txid coinbase transactions (BIP30).
    159+            if (is_bip30_block && tx->IsCoinBase()) {
    


    Sjors commented at 2:48 pm on April 26, 2020:
    Why do we need to skip these? Block height is included in the muhash anyway, so it wouldn’t be a duplicate entry.

    fjahr commented at 2:40 pm on May 15, 2020:
    If we wouldn’t skip these we would get a different hash from the index than we get from the non-index nodes because these outputs are also not in the UTXO set which the non-index nodes use to calculate the hash. My goal is that the hash is the same with or without the index.
  63. in src/index/coinstatsindex.cpp:169 in 8e17009dcc outdated
    164+                const CTxOut& out = tx->vout[j];
    165+                COutPoint outpoint = COutPoint(tx->GetHash(), j);
    166+                Coin coin = Coin(out, pindex->nHeight, tx->IsCoinBase());
    167+
    168+                // Skip unspendable coins
    169+                if (coin.out.scriptPubKey.IsUnspendable()) continue;
    


    Sjors commented at 2:54 pm on April 26, 2020:
    Why do we skip unspendable coins here? Won’t that lead to a result inconsistent with gettxoutsetinfo?

    fjahr commented at 2:39 pm on May 15, 2020:
    No, I am only skipping coins here that are not part of the UTXO set. The same logic is used for evaluating if a coin is added to the UTXO set, see https://github.com/bitcoin/bitcoin/blob/951870807ea28e05cf074e364e1b55e985ab9f6d/src/coins.cpp#L70
  64. in src/index/coinstatsindex.cpp:171 in 8e17009dcc outdated
    166+                Coin coin = Coin(out, pindex->nHeight, tx->IsCoinBase());
    167+
    168+                // Skip unspendable coins
    169+                if (coin.out.scriptPubKey.IsUnspendable()) continue;
    170+
    171+                TruncatedSHA512Writer ss(SER_DISK, 0);
    


    Sjors commented at 2:54 pm on April 26, 2020:
    Ideally this code should be shared with ApplyStats()

    fjahr commented at 2:32 pm on May 15, 2020:
    done
  65. in src/index/coinstatsindex.cpp:179 in 8e17009dcc outdated
    178+                m_nTotalAmount += coin.out.nValue;
    179+                m_nBogoSize += 32 /* txid */ + 4 /* vout index */ + 4 /* height + coinbase */ + 8 /* amount */ +
    180+                               2 /* scriptPubKey len */ + coin.out.scriptPubKey.size() /* scriptPubKey */;
    181+            }
    182+
    183+            // The coinbase tx has no undo data since no former output is spent
    


    Sjors commented at 3:00 pm on April 26, 2020:
    I’m confused: the coinbase is added to the muhash accumulator.

    fjahr commented at 2:32 pm on May 15, 2020:
    Yes but the coinbase does not have any undo data because it is not spending any outputs. This section is only dealing with the undo data of the transactions in the block, it is not dealing with reorgs etc. which are further below.
  66. in src/index/coinstatsindex.cpp:349 in 8e17009dcc outdated
    352+            return error("%s: previous block header belongs to unexpected block %s; expected %s",
    353+                         __func__, read_out.first.ToString(), expected_block_hash.ToString());
    354+        }
    355+    }
    356+
    357+    MuHash3072 do_muhash;
    


    Sjors commented at 3:05 pm on April 26, 2020:
    Why is this called do_muhash rather than undo_muhash?

    fjahr commented at 2:29 pm on May 15, 2020:

    I came up with do_muhash because I needed the contrary term to undo_muhash. Since we are talking about rolling back a block that has ‘do’ (utxo created in the block) and ‘undo’ data (utxos destroyed in the block). I was thinking if I should reverse these terms when a block is removed or keep the terms the same as they were when the block was added. For me it seemed less confusing to keep the naming the same as when the block was added. So ‘do’ data that was added when the block is added still called ‘do’ data when the block is rolled back, just that the ‘do’ data is removed from the utxo set this time. Either way I probably confusing so I will keep adding comments to make this more clear.

    I should add that I only need these intermediate values because division is much slower than multiplication. Since these operations are cummutative I can multiply all the values that need to be removed from the muhash and then remove them all with just one division.

  67. in src/index/coinstatsindex.cpp:197 in 8e17009dcc outdated
    200+                                   2 /* scriptPubKey len */ + coin.out.scriptPubKey.size() /* scriptPubKey */;
    201+                }
    202+            }
    203+        }
    204+
    205+        m_muhash /= undo_muhash;
    


    Sjors commented at 3:06 pm on April 26, 2020:
    Why is this called undo_muhash and not do_muhash? And why is it divided?

    fjahr commented at 2:30 pm on May 15, 2020:
    It is the intermediate muhash for all the utxos that need to be removed/undone. See also my longer explanation for do_muhash.
  68. Sjors commented at 3:14 pm on April 26, 2020: member

    Concept ACK.

    Is there any reference document for MuHash3072? If so, it would be good to link to that in the header file. In particular, does the (single) test vector exist anywhere else?

    The mailinglist post you mentioned at the top also contains useful background.

    This particular part of the post might be worth copying (leaving only the MuHash part):

    Interestingly, both ECMH and MuHash not only support adding set elements in any order but also deleting in any order. As a result, we can simply maintain a running sum for the UTXO set as a whole, and add/subtract when creating/spending an output in it. In the case of MuHash it is slightly more complicated, as computing an inverse is relatively expensive. This can be solved by representing the running value as a fraction, and multiplying created elements into the numerator and spent elements into the denominator. Only when the final hash is desired, a single modular inverse and multiplication is needed to combine the two.

    As the update operations are also associative, H(a)+H(b)+H(c)+H(d) can in fact be computed as (H(a)+H(b)) + (H(c)+H(d)). This implies that all of this is perfectly parallellizable: each thread can process an arbitrary subset of the update operations, allowing them to be efficiently combined later.

    I can’t speak to the cryptographic assumptions underlying MuHash, but at the least within the scope of this PR it’s not used for anything security critical. As the mailinglist discussion pointed out, this feature is a good way to gain more experience with this type of hashing in a non consensus-critical manner.

    I also can’t say anything intelligent about the implementation, other than that it’s nice and short. Maybe @practicalswift can add a fuzz harness?

    b208dfd5712bdc658217c84d276d82e92cb2cf47 Add TruncatedSHA256Writer needs a test.

    I think you can safely squash f59d2223199d26ef591d9a3fc3282b19a4cb833a into the original and make yourself Co-Authored-By? You’re not changing the algorithm.

    I’m currently running cf6ba169ae1999a98db831900d36e0571a9d933d and building a mainnet index on macOS 10.15.4. I’ll update my comment with an example result from a recent block. I ran the first 100K blocks with --enable-debug, stopped and continued without debug.

    Concept ACK on dropping transaction count from gettxoutsetinfo. “The number of transactions with unspent outputs” seems a useless metric. If someone does have a use case for it, hopefully we’ll learn before the release. In that case it could be added to getblock.

    I got very confused by the indexer code in 8e17009dcc4de977ced030e624b18ecc6ba19e63, sorry for any dumb questions…

    If it’s not too big a code change, adding a height argument to gettxoutsetinfo would unleash the power of this index, and also make it easier to compare hashes between reviewers. It also makes it easier to manually compare against the hash generated without an index (with something a bit more complex than the functional test).

  69. DrahtBot added the label Needs rebase on May 8, 2020
  70. fjahr force-pushed on May 15, 2020
  71. Add TruncatedSHA256Writer e5c842146b
  72. MuHash3072 implementation
    Co-authored-by: Fabian Jahr <fjahr@protonmail.com>
    c46296006b
  73. Add x86_64 assembly optimization for MuHash df0a6e79ab
  74. crypto: Add serialization methods to Muhash 1a5928c393
  75. bench: Add seperate benchmarks for Muhash precompute, add and div 58426412f0
  76. coinstats: Remove tracking of transaction count in UTXO set
    Tracking this metric would be very complicated to implement when
    using a coinstatsindex and it does not seem to be used for anything
    meaningful.
    8e7c45da95
  77. coinstats: Use Muhash to calculate UTXO set hash dfba5c2bfa
  78. coinstats: Extract GetTruncatedSHA512Hash function
    This makes it possible to reuse this code in coinstatsindex.
    c092588cfa
  79. index: Add CoinStatsIndex
    The index holds the values previously calculated in coinstats.cpp
    for each block, representing the state of the UTXO set at each
    height.
    1502a08b4a
  80. index: CoinStatsIndex can be activated with command line flag c692275243
  81. test: Add functional test for UTXO Stats Index 92f4985753
  82. test: Add unit test for UTXO Stats Index 9d1b1ecf0f
  83. fjahr force-pushed on May 15, 2020
  84. DrahtBot removed the label Needs rebase on May 15, 2020
  85. rpc: gettxoutsetinfo can be requested for specific blockheights 58cf819caa
  86. fjahr force-pushed on May 15, 2020
  87. fjahr commented at 3:15 pm on May 15, 2020: member

    Thanks for the review and the great questions!

    Concept ACK.

    Is there any reference document for MuHash3072? If so, it would be good to link to that in the header file. In particular, does the (single) test vector exist anywhere else?

    I did a bit of research when I started the work on this but I am not aware of any reference document or official test vectors. Maybe @sipa knows if I missed something?

    The mailinglist post you mentioned at the top also contains useful background.

    This particular part of the post might be worth copying (leaving only the MuHash part):

    Interestingly, both ECMH and MuHash not only support adding set elements in any order but also deleting in any order. As a result, we can simply maintain a running sum for the UTXO set as a whole, and add/subtract when creating/spending an output in it. In the case of MuHash it is slightly more complicated, as computing an inverse is relatively expensive. This can be solved by representing the running value as a fraction, and multiplying created elements into the numerator and spent elements into the denominator. Only when the final hash is desired, a single modular inverse and multiplication is needed to combine the two.

    As the update operations are also associative, H(a)+H(b)+H(c)+H(d) can in fact be computed as (H(a)+H(b)) + (H(c)+H(d)). This implies that all of this is perfectly parallellizable: each thread can process an arbitrary subset of the update operations, allowing them to be efficiently combined later.

    Added an adapted version of this text to muhash.h.

    I can’t speak to the cryptographic assumptions underlying MuHash, but at the least within the scope of this PR it’s not used for anything security critical. As the mailinglist discussion pointed out, this feature is a good way to gain more experience with this type of hashing in a non consensus-critical manner.

    I also can’t say anything intelligent about the implementation, other than that it’s nice and short. Maybe @practicalswift can add a fuzz harness?

    Happy to work on a fuzz harness myself, I am trying to get some more experience with it :)

    b208dfd Add TruncatedSHA256Writer needs a test.

    I am currently a bit unsure of what the right way to go is for this. A first observation is that CHashWriter is also untested. I think both these classes are pretty thin wrappers of the actual hash functions they use, so a test might not provide much value. But TruncatedSHA256Writer does a bit more, so I think it could still be valuable to have a sanity check test. I was looking for any test vectors provided for SHA-512/256 and while there are is no official set, at least these could be enough for a sanity check. However, the problem is that for SHA-512/256 different initializers are used. We reuse SHA-512 for simplicity reasons, I guess. It’s easy enough to add the other initializers. I am just not sure it worth the additional review effort. Another fact that I stumbled over, is that the serializer used here is writing the null character of a string into the hash, where I am not sure if that is a bug but it certainly makes working with test vectors hard.

    So, I am not sure if adding SHA-512/256 is worth it. And I am also unsure if a test for TruncatedSHA256Writer would be valuable without test vectors and given the only limited logic inside the class.

    I would be happy to get input from @sipa here.

    I think you can safely squash f59d222 into the original and make yourself Co-Authored-By? You’re not changing the algorithm.

    done

    I’m currently running cf6ba16 and building a mainnet index on macOS 10.15.4. I’ll update my comment with an example result from a recent block. I ran the first 100K blocks with --enable-debug, stopped and continued without debug.

    Concept ACK on dropping transaction count from gettxoutsetinfo. “The number of transactions with unspent outputs” seems a useless metric. If someone does have a use case for it, hopefully we’ll learn before the release. In that case it could be added to getblock.

    I got very confused by the indexer code in 8e17009, sorry for any dumb questions…

    I will try to add more comments to make it easier to understand.

    If it’s not too big a code change, adding a height argument to gettxoutsetinfo would unleash the power of this index, and also make it easier to compare hashes between reviewers. It also makes it easier to manually compare against the hash generated without an index (with something a bit more complex than the functional test).

    Done, I had planned to do this as a follow-up anyway and with the removal of some duplicate code, the number of LOC has hardly grown.

  88. DrahtBot commented at 12:54 pm on May 26, 2020: member

    🐙 This pull request conflicts with the target branch and needs rebase.

  89. DrahtBot added the label Needs rebase on May 26, 2020
  90. fjahr renamed this:
    Index for UTXO Set Statistics
    [WIP] Index for UTXO Set Statistics
    on May 28, 2020
  91. laanwj removed this from the "Chasing Concept ACK" column in a project

  92. jonatack commented at 10:49 am on May 29, 2020: member

    Tested Concept and approach ACK 58cf819caabe98

    This PR builds cleanly. I launched bitcoind with -coinstatsindex=1 to create the index, which runs in a separate thread and does not block the node. It regularly logs progress in the debug log.

    The first 459k blocks took 19 hours (~400 blocks/minute) with debug build with many flags for reviewing.

    I rebuilt without debug enabled and no flags and restarted bitcoind. The remaining blocks 459k-632k (173k blocks) took 3.5 hours (~800 blocks/minute).

    The index was enabled when the debug log prints: “coinstatsindex is enabled at height 632059” and RPC gettxoutsetinfo then runs near-instantly.

  93. in src/rpc/blockchain.cpp:997 in 58cf819caa
     996                         {RPCResult::Type::STR_HEX, "bestblock", "The hash of the block at the tip of the chain"},
     997-                        {RPCResult::Type::NUM, "transactions", "The number of transactions with unspent outputs"},
     998                         {RPCResult::Type::NUM, "txouts", "The number of unspent transaction outputs"},
     999                         {RPCResult::Type::NUM, "bogosize", "A meaningless metric for UTXO set size"},
    1000-                        {RPCResult::Type::STR_HEX, "hash_serialized_2", "The serialized hash"},
    1001+                        {RPCResult::Type::STR_HEX, "utxo_set_hash", "The serialized hash"},
    


    Sjors commented at 11:01 am on June 9, 2020:

    Nit, for when you work on the RPC PR: let’s call this utxo_set_muhash in case we want to try another algo later.

    The hash_serialized_2 field should either be (instantly) deprecated, or remain in use when there’s no index. The documentation for -coinstatsindex should point this out.


    Sjors commented at 11:19 am on June 9, 2020:
    We could additionally expose the full 384-byte hash (base64 encoded). IIUC someone could use that to check if a specific UTXO is present. Update: that particular example may not be terribly useful
  94. in src/node/coinstats.cpp:39 in 58cf819caa
    44+uint256 GetTruncatedSHA512Hash(const COutPoint& outpoint, const Coin& coin) {
    45+    TruncatedSHA512Writer ss(SER_DISK, 0);
    46+    ss << outpoint;
    47+    ss << (uint32_t)(coin.nHeight * 2 + coin.fCoinBase);
    48+    ss << coin.out;
    49+    return ss.GetHash();
    


    Sjors commented at 12:19 pm on June 9, 2020:

    Let’s document what these fields are and why they’re useful:

    • outpoint: transaction hash + output position: uniquely identifies an unspent output (BIP 30)
    • height + coinbase: determines if it can be spent, e.g. given coinbase maturity and OP_CSV
    • out: amount + scriptPubKey

    If you have the above information and all the block headers, you can validate a new block, without needing data from previous blocks. This makes it suitable for pruning and UTXO snapshots.

  95. Sjors commented at 4:39 pm on June 9, 2020: member

    In #19145 (missing in OP) you’re adding a hash_type option to gettxoutsetinfo, which can be none. This makes an index immediately useful, even without MuHash. So you make a PR for that independently.

    Adding MuHash to the index later might be annoying, however if the plan is to support different hash types, it sounds like MuHash should be in a separate index from the coin stats anyway?

  96. fjahr commented at 12:57 pm on June 16, 2020: member

    In #19145 (missing in OP) you’re adding a hash_type option to gettxoutsetinfo, which can be none. This makes an index immediately useful, even without MuHash. So you make a PR for that independently.

    Adding MuHash to the index later might be annoying, however if the plan is to support different hash types, it sounds like MuHash should be in a separate index from the coin stats anyway?

    I needed to reflect a bit on this but now do I think it would be good to offer the index already without the hash. But I think to keep the hash in a different index and letting users set these with separate settings might be too much responsibility for the users. I would suggest a more opinionated approach. Rather than having -coinstatsindex=none and -coinstatsindex=muhash later I would suggest it stays -coinstatsindex and at first there would be no hash and users would only get a fast response when they use hash_type=none, then muhash can be added to the index and users will then be receiving responses from the index with hash_type=none and hash_type=muhash. Personally I think we don’t want to have a lot of hash types, too much unnecessary choice for the user. I think we use this now to support hash_serialized_2 for some time and maybe also to run ECMH in parallel but ideally we should be ending up at one hash_type that is the default and most of the other options can probably be removed.

    I will prepare the alternate PRs with only hash_type=none and I hope I can make so people are not totally confused by all the different PRs. :)

  97. MarcoFalke referenced this in commit b52e25cc1b on Jul 6, 2020
  98. sidhujag referenced this in commit e98a4e94d1 on Jul 8, 2020
  99. laanwj referenced this in commit 27eeb0337b on Aug 20, 2020
  100. sidhujag referenced this in commit defff93348 on Aug 20, 2020
  101. laanwj referenced this in commit 48c1083632 on Sep 1, 2020
  102. sidhujag referenced this in commit a97b9a0281 on Sep 3, 2020
  103. jnewbery commented at 10:49 am on October 14, 2020: member
    PR 19105 is merged
  104. laanwj referenced this in commit b6a71b80d2 on Jan 7, 2021
  105. sidhujag referenced this in commit d93f2b4943 on Jan 7, 2021
  106. laanwj referenced this in commit 8d82eddee6 on Feb 12, 2021
  107. sidhujag referenced this in commit 266fcc8b6d on Feb 12, 2021
  108. in src/index/coinstatsindex.cpp:104 in 58cf819caa
     99+CoinStatsIndex::CoinStatsIndex(size_t n_cache_size, bool f_memory, bool f_wipe)
    100+{
    101+    fs::path path = GetDataDir() / "indexes" / "coinstats";
    102+    fs::create_directories(path);
    103+
    104+    m_db = MakeUnique<CoinStatsIndex::DB>(path / "db", n_cache_size, f_memory, f_wipe);
    


    fanquake commented at 0:37 am on March 12, 2021:
  109. Rspigler commented at 2:44 am on April 19, 2021: contributor
    Chart can be updated as PR #19055 and #19145 have been merged
  110. fjahr commented at 10:33 pm on April 24, 2021: member

    Chart can be updated as PR #19055 and #19145 have been merged

    Thanks for the reminder, done!

  111. Rspigler commented at 7:23 am on April 25, 2021: contributor

    I don’t mean to spam the PR, but the chart was updated incorrectly.

    PR 19055 is merged but marked as open PR 19181 is open but marked as merged

    Thank you for all your work!

  112. fjahr commented at 10:20 am on April 25, 2021: member

    I don’t mean to spam the PR, but the chart was updated incorrectly.

    PR 19055 is merged but marked as open PR 19181 is open but marked as merged

    Thank you for all your work!

    Thanks, I was too hastily. Fixed.

  113. laanwj referenced this in commit 2b45cf0bcd on Apr 30, 2021
  114. Fabcien referenced this in commit 8ae513e386 on Aug 31, 2021
  115. DrahtBot commented at 11:22 am on December 15, 2021: member
    • Is it still relevant? ➡️ Please solve the conflicts to make it ready for review and to ensure the CI passes.
    • Is it no longer relevant? ➡️ Please close.
    • Did the author lose interest or time to work on this? ➡️ Please close it and mark it ‘Up for grabs’ with the label, so that it can be picked up in the future.
  116. Sjors commented at 12:16 pm on December 15, 2021: member
    @fjahr should this be closed? If #19181 is all that’s left, it can just be its own PR.
  117. fjahr commented at 4:51 pm on December 19, 2021: member

    Closing this as almost everything that was aimed for has been implemented by now. Thanks to everyone who supported this project with advice, code reviews, or just general words of encouragement!

    There are still two open PRs that are related and could use some review:

    • #19792 adds an RPC to export the coinstats to a CSV file
    • #19181 adds ASM optimizations (in rebase at the moment)
  118. fjahr closed this on Dec 19, 2021

  119. DrahtBot locked this on Dec 19, 2022

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-11-17 09:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me