rpc: Add sqlite format option for dumptxoutset

dunxen commented at 9:30 am on April 23, 2022: contributor

This allows RPC users of the dumptxoutset command to output to a SQLite DB format with human-readable fields. This approach allows for trivial export to JSON and CSV by using sqlite3 (detailed in #24628).

There is a new optional ‘format’ parameter for dumptxoutset that can take on the values of either ‘compact’ or ‘sqlite’ with ‘compact’ as the default. This allows for backward compatibility.

Closes #24628

dunxen commented at 9:33 am on April 23, 2022: contributor

Some basic timings for mainnet on my (admittedly not so amazing iMac), comparing to the existing compact format.

Time

Size

Roughly around twice as long and twice as big.

in src/rpc/blockchain.cpp:2252 in eaa4e63dac outdated

2281-{
2282-    const ArgsManager& args{EnsureAnyArgsman(request.context)};
2283-    const fs::path path = fsbridge::AbsPathJoin(args.GetDataDirNet(), fs::u8path(request.params[0].get_str()));
2284-    // Write to a temporary path and then move into `path` on completion
2285-    // to avoid confusion due to an interruption.
2286-    const fs::path temppath = fsbridge::AbsPathJoin(args.GetDataDirNet(), fs::u8path(request.params[0].get_str() + ".incomplete"));

dunxen commented at 9:36 am on April 23, 2022:

clang-format-diff.py formatted this and there are some changes. Not ideal for review, but let me know if I should remove the formatting.

prusnak commented at 11:32 am on April 24, 2022:

The review would be so much easier if you reverted the formatting changes.

dunxen commented at 6:59 pm on April 24, 2022:

Hmm, sorry. This comment missed my eye. I can revert this. Will push with the Win64 errors I figured out.

dunxen commented at 7:35 pm on April 24, 2022:

Reverted formatting. Should be much clearer now :)

DrahtBot added the label RPC/REST/ZMQ on Apr 23, 2022

DrahtBot commented at 4:21 am on April 24, 2022: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#25191 (build: Make --with-gui=qt6 configure option available on macOS by hebasto)
#24923 (Rework logging timer by MarcoFalke)
#24798 ([POC] build: Hello Qt 6 by hebasto)
#24410 ([kernel 2a/n] Split hashing/index GetUTXOStats codepaths, decouple from coinstatsindex by dongcarl)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

in src/rpc/blockchain.cpp:35 in eaa4e63dac outdated

30@@ -31,6 +31,7 @@
31 #include <rpc/server_util.h>
32 #include <rpc/util.h>
33 #include <script/descriptor.h>
34+#include <sqlite3.h>

0xB10C commented at 7:55 am on April 24, 2022:

I think this makes sqlite3 a required dependency for building Bitcoin Core, right?

Some CI tasks are failing because they don’t have sqlite3.h.

https://github.com/bitcoin/bitcoin/pull/24952/checks?check_run_id=6139350735

0rpc/blockchain.cpp:34:10: fatal error: 'sqlite3.h' file not found
1#include <sqlite3.h>
2         ^~~~~~~~~~~

dunxen commented at 7:58 am on April 24, 2022:

Ah, right! Will fix up

dunxen commented at 12:06 pm on April 24, 2022:

Forgot to handle the SQLITE_* constants too. Will fix up.

in src/rpc/blockchain.cpp:2397 in eaa4e63dac outdated

2413+        if (ret != SQLITE_OK) {
2414+            sqlite3_close(db);
2415+            throw JSONRPCError(RPC_INTERNAL_ERROR, strprintf("Unable to open database file %s", path.u8string()));
2416+        }
2417+
2418+        ret = sqlite3_exec(db, "PRAGMA journal_mode=wal2;", nullptr, nullptr, nullptr);

prusnak commented at 8:55 am on April 24, 2022:

Please add comment explaining why we want journal_mode set to wal2 and ideally add a link to SQLite documentation: https://www.sqlite.org/cgi/src/doc/wal2/doc/wal2.md

prusnak commented at 11:43 am on April 24, 2022:

Btw, isn’t journal_mode = off even better match for what we want? I think that not using journal might significantly improve write performance.

dunxen commented at 11:48 am on April 24, 2022:

Yeah, we don’t need/use rollbacks. It’s just an export so that makes sense.

prusnak commented at 6:52 pm on April 24, 2022:

I see this is now addressed in 84a6e28a8536fdc79796856e36be86a3bffe1848, thanks!

dunxen force-pushed on Apr 24, 2022

prusnak changes_requested

in src/rpc/blockchain.h:30 in 21ee90dda3 outdated

26@@ -27,6 +27,7 @@ struct NodeContext;
27 } // namespace node
28 
29 static constexpr int NUM_GETBLOCKSTATS_PERCENTILES = 5;
30+using coinsqlite_cb_t = std::function<std::string(const COutPoint&, const Coin&)>;

prusnak commented at 11:34 am on April 24, 2022:

Is this used anywhere?

dunxen commented at 11:36 am on April 24, 2022:

Ah, not anymore. Was going to have configurable fields like in the CSV PR, but decided against it otherwise things would get a little too complex here.

Will remove 👍

prusnak commented at 6:53 pm on April 24, 2022:

I see this is now addressed in 84a6e28a8536fdc79796856e36be86a3bffe1848, thanks!

in src/rpc/blockchain.h:57 in 21ee90dda3 outdated

53@@ -53,6 +54,7 @@ void CalculatePercentilesByWeight(CAmount result[NUM_GETBLOCKSTATS_PERCENTILES],
54  * @return a UniValue map containing metadata about the snapshot.
55  */
56 UniValue CreateUTXOSnapshot(
57+    const bool is_compact,

prusnak commented at 11:36 am on April 24, 2022:

maybe enum UTXOSnapshotFormat { FORMAT_COMPACT = 0, FORMAT_SQLITE = 1 }; instead of bool would be nicer and more explicit?

dunxen commented at 11:38 am on April 24, 2022:

Yes, I agree. I was going back and forth between them but I think explicit is better here. Would be easier to introduce another format in the future too. Will change 👍

prusnak commented at 6:53 pm on April 24, 2022:

I see this is now addressed in 84a6e28a8536fdc79796856e36be86a3bffe1848, thanks!

dunxen commented at 11:40 am on April 24, 2022: contributor

~~Some other failures related to string type conversions in Windows I’m not so familiar with here. Will have a closer look when I’m back home.~~

EDIT: Never mind. I can’t read…

dunxen force-pushed on Apr 24, 2022

in src/rpc/blockchain.cpp:2415 in 889f6c719d outdated

2429 
2430-    afile.fclose();
2431+            // We have no need for rollbacks here, so we can set the `journal_mode` to 'off'. See: https://www.sqlite.org/pragma.html#pragma_journal_mode
2432+            ret = sqlite3_exec(db, "PRAGMA journal_mode=off;", nullptr, nullptr, nullptr);
2433+            if (ret != SQLITE_OK) {
2434+                throw std::runtime_error(strprintf("SQLiteDatabase: Failed to set wal2: %s\n", sqlite3_errstr(ret)));

willcl-ark commented at 9:12 pm on April 24, 2022:

I think wal2 is a hangover from a previous version here

dunxen commented at 9:26 am on April 25, 2022:

Indeed, thanks!

in src/rpc/blockchain.cpp:2300 in 889f6c719d outdated

2315+    switch (format) {
2316+    case UTXOSnapshotFormat::COMPACT:
2317+        fs::rename(temppath, path);
2318+        break;
2319+    case UTXOSnapshotFormat::SQLITE:
2320+        fs::remove(temppath);

willcl-ark commented at 9:24 pm on April 24, 2022:

curious why we don’t use the temppath for sqlite like we do on compact?

w0xlt commented at 2:39 am on April 25, 2022:

temppath adds “.incomplete” to the end of the filename. path is the path entered by the user, so the database is created using this parameter.

temppath can eventually be removed from CreateUTXOSnapshot() as this parameter is used only once and for logging purposes. I think the same information can be obtained from afile as it is created from temppath.

dunxen commented at 9:25 am on April 25, 2022:

Yes, it’s not used in SQLite branch (nothing actually gets written to it), but it is somewhat tightly coupled, and since we need to do the USE_SQLITE check, it makes it a little difficult to completely decouple. So I just thought to remove it at the end. Acts as a decent indicator that nothing bad happened too (although redundant).

Maybe this can be refactored in the future if we don’t want it.

willcl-ark approved

willcl-ark commented at 9:38 pm on April 24, 2022: contributor

tACK 889f6c719dbd2bca5f38a6a5a89f39fb4639558f. Manually inspected a few random rows from the first 100 in the outputted db, which looked good. I left a few additional comments inline.

I also see twice the time and speed on my side:

w0xlt approved

w0xlt commented at 2:24 am on April 25, 2022: contributor

tACK 889f6c7

in src/rpc/blockchain.cpp:2276 in 889f6c719d outdated

2272@@ -2269,6 +2273,7 @@ static RPCHelpMan dumptxoutset()
2273         },
2274         RPCExamples{
2275             HelpExampleCli("dumptxoutset", "utxo.dat")
2276+    + HelpExampleCli("dumptxoutset", "\"utxo.db\" \"sqlite\"")

prusnak commented at 7:40 am on April 25, 2022:

Formatting nit-pick:

0            HelpExampleCli("dumptxoutset", "utxo.dat") +
1            HelpExampleCli("dumptxoutset", "\"utxo.db\" \"sqlite\"")

in src/rpc/blockchain.cpp:2370 in 889f6c719d outdated

2370+    switch (format) {
2371+    case UTXOSnapshotFormat::COMPACT:
2372+        {
2373+            LOG_TIME_SECONDS(strprintf("writing UTXO snapshot at height %s (%s) to file %s (via %s)",
2374+                                    tip->nHeight, tip->GetBlockHash().ToString(),
2375+                                    fs::PathToString(path), fs::PathToString(temppath)));

jonatack commented at 10:15 am on April 25, 2022:

indentation nit (or just run clang-format on the diff)

 0--- a/src/rpc/blockchain.cpp
 1+++ b/src/rpc/blockchain.cpp
 2@@ -2366,8 +2366,8 @@ UniValue CreateUTXOSnapshot(
 3     case UTXOSnapshotFormat::COMPACT:
 4         {
 5             LOG_TIME_SECONDS(strprintf("writing UTXO snapshot at height %s (%s) to file %s (via %s)",
 6-                                    tip->nHeight, tip->GetBlockHash().ToString(),
 7-                                    fs::PathToString(path), fs::PathToString(temppath)));
 8+                                       tip->nHeight, tip->GetBlockHash().ToString(),
 9+                                       fs::PathToString(path), fs::PathToString(temppath)));
10             SnapshotMetadata metadata{tip->GetBlockHash(), stats.coins_count, tip->nChainTx};

in src/rpc/blockchain.cpp:2397 in 889f6c719d outdated

2408         }
2409+    case UTXOSnapshotFormat::SQLITE:
2410+        {
2411+#ifdef USE_SQLITE
2412+            LOG_TIME_SECONDS(strprintf("writing UTXO snapshot at height %s (%s) to sqlite DB file %s",
2413+                                    tip->nHeight, tip->GetBlockHash().ToString(), fs::PathToString(path)));

jonatack commented at 10:16 am on April 25, 2022:

indentation nit

0                                       tip->nHeight, tip->GetBlockHash().ToString(), fs::PathToString(path)));

in src/rpc/blockchain.cpp:2366 in 889f6c719d outdated

2366-        tip->nHeight, tip->GetBlockHash().ToString(),
2367-        fs::PathToString(path), fs::PathToString(temppath)));
2368-
2369-    SnapshotMetadata metadata{tip->GetBlockHash(), stats.coins_count, tip->nChainTx};
2370+    switch (format) {
2371+    case UTXOSnapshotFormat::COMPACT:

jonatack commented at 10:20 am on April 25, 2022:

The code blocks below each case are indented 4 spaces too deeply (don’t hesitate to run clang-format on the diff).

0    case UTXOSnapshotFormat::COMPACT: {

dunxen commented at 10:53 am on April 25, 2022:

Thanks. I had removed a formatting change to make review a bit easier and then was not running it on subsequent changes.

dunxen force-pushed on Apr 25, 2022

in test/functional/rpc_dumptxoutset.py:102 in 2d95dbb489 outdated

87+    def run_test(self):
88+        """Test trivial usage of the dumptxoutset RPC command."""
89+        self.test_compact_format()
90+        if self.is_sqlite_compiled():
91+            self.test_sqlite_format()
92+

jonatack commented at 10:29 am on April 25, 2022:

suggestions

 0+++ b/test/functional/rpc_dumptxoutset.py
 1@@ -21,6 +21,7 @@ class DumptxoutsetTest(BitcoinTestFramework):
 2         self.num_nodes = 1
 3 
 4     def test_compact_format(self):
 5+        self.log.info("Test compact format")
 6         node = self.nodes[0]
 7@@ -55,8 +56,8 @@ class DumptxoutsetTest(BitcoinTestFramework):
 8 
 9     def test_sqlite_format(self):
10+        self.log.info("Test sqlite format")
11         node = self.nodes[0]
12-
13         FILENAME = 'txoutset.db'
14@@ -83,10 +84,10 @@ class DumptxoutsetTest(BitcoinTestFramework):
15     def run_test(self):
16-        """Test trivial usage of the dumptxoutset RPC command."""
17         self.test_compact_format()
18         if self.is_sqlite_compiled():
19             self.test_sqlite_format()
20 
21+
22 if __name__ == '__main__':
23     DumptxoutsetTest().main()

in src/rpc/blockchain.cpp:2261 in 2d95dbb489 outdated

2257@@ -2255,6 +2258,7 @@ static RPCHelpMan dumptxoutset()
2258         "Write the serialized UTXO set to disk.",
2259         {
2260             {"path", RPCArg::Type::STR, RPCArg::Optional::NO, "Path to the output file. If relative, will be prefixed by datadir."},
2261+            {"format", RPCArg::Type::STR, RPCArg::Default{"compact"}, "'compact' to output a compact binary serialized format, 'sqlite' to output a SQLite database with table 'txoutset' (requires sqlite3)", "format"},

jonatack commented at 10:34 am on April 25, 2022:

This outputs the following:

02. format    (string, optional, default="compact") 'compact' to output a compact binary serialized format, 'sqlite' to output a SQLite database with table 'txoutset' (requires sqlite3)

Maybe use quotes the same as the default.

0            {"format", RPCArg::Type::STR, RPCArg::Default{"compact"}, "\"compact\" to output a compact binary serialized format, \"sqlite\" to output a SQLite database with table \"txoutset\" (requires sqlite3)", "format"},

jonatack commented at 10:35 am on April 25, 2022: contributor

Approach ACK

dunxen force-pushed on Apr 25, 2022

theStack commented at 3:51 pm on April 25, 2022: contributor

Concept ACK

I first thought creating an external tool to convert from compact-serialized format to SQLite was a viable alternative to implement this directly in bitcoind, but it turned out to be bad idea; one of the reasons is that the decompression logic has to be implemented again, which in one part even needs secp256k1 (https://github.com/bitcoin/bitcoin/issues/24628#issuecomment-1108469715).

Some initial thoughts:

Is there any reason why we store prevoutHash and scriptPubKey with type TEXT rather than a BLOB? Using the latter should reduce the size of the resulting database by almost 2x, and SQLite supports select hex(column) in case the hex representation is needed by the user
Not sure if it helps in this scenario or is worthy to investigate, but in some cases I’ve made good experiences by tuning the cache_size parameter (https://www.sqlite.org/pragma.html#pragma_cache_size) in order to have less frequent writes (the default is 2 MB)

prusnak commented at 4:00 pm on April 25, 2022: contributor

Is there any reason why we store prevoutHash and scriptPubKey with type TEXT rather than a BLOB?

One of the goals of this change is to enable trivial creation of JSON/CSV files with the data. This can be achieved simply by using one of the following:

sqlite3 -header -csv utxo.db "SELECT * FROM utxos" > utxo.csv
sqlite3 -json utxo.db "SELECT * FROM utxos" > utxo.json

If BLOB was used instead of hex stored in TEXT, this conversion step could not have been done trivially as one would have to specify a more complex SQL query such as:

SELECT hex(txid), vout, value, coinbase, height, hex(scriptpubkey) FROM utxos

However, it might still be worth a shot to require more complicated SQL query if we really gain 2x the performance and decrease the disk space to 50%.

dunxen commented at 4:06 pm on April 25, 2022: contributor

Not sure if it helps in this scenario or is worthy to investigate, but in some cases I’ve made good experiences by tuning the cache_size parameter (https://www.sqlite.org/pragma.html#pragma_cache_size) in order to have less frequent writes (the default is 2 MB)

Thanks! I’m happy to investigate tuning this. We’re currently roughly getting it to be double the time of a compact dump.

prusnak commented at 4:09 pm on April 25, 2022: contributor

We’re currently roughly getting it to be double the time of a compact dump.

I think you can’t get much better than this. You are writing approx 2x the data (because of hex representation of txid and scriptpubkey and my guess is that the whole operation is I/O bound and not CPU bound.

IMO the only way how to make the process fast is to use BLOB type as suggested by @theStack above at the expense of requiring more complex query when working with the data.

dunxen commented at 4:10 pm on April 25, 2022: contributor

However, it might still be worth a shot to require more complicated SQL query if we really gain 2x the performance and decrease the disk space to 50%.

I think that’s a fair trade-off. Also should not be all that more complicated of a query.

theStack commented at 4:34 pm on April 25, 2022: contributor

@prusnak: Thanks for elaborating, easy conversion to CSV/JSON etc. makes sense!

One thing that has to be taken care of is the endianness of the txid hash. Unfortunately Bitcoin displays (and accepts, in case of user input for RPCs) txids and block hashes as little endian, but they are stored internally as big endian: https://bitcoin.stackexchange.com/a/2069 That means, if we decide to store in TEXT, then the txid has already the typical form as we know them from wallets, block explorers, debug outputs etc. (hash.GetHex() takes care of that), but on the other hand if we want to use that prevout for e.g. constructing a transaction in a raw byte-by-byte way, then it would need to be reversed first. If we store it as a BLOB though, it has the internal big endian format, but hex(txid) doesn’t show it in the right way, as it would need to be reversed first. Unfortunately, there is no built-in “reverse” operator in SQLite :/ If there was, we could simply do something like the following for the conversion script: SELECT hex(reverse(txid)), vout, value, coinbase, height, hex(scriptpubkey) FROM utxos

(For the scriptPubKey, there is no such endianness problem as it is simply a sequence of bytes)

So as far as I can see, we have to decide between either easy-conversion or ~2x performance / 50% size (though the latter is still just an assumption). One possible idea would be to store the BLOB already reversed, but this is kind of a something in-between solution that possibly is even more confusing for users.

brunoerg commented at 0:21 am on April 26, 2022: contributor

Concept ACK

in src/rpc/blockchain.cpp:2390 in 2d95f24c35 outdated

2418+        }
2419+
2420+        ret = sqlite3_open(fs::PathToString(path).c_str(), &db);
2421+        if (ret != SQLITE_OK) {
2422+            sqlite3_close(db);
2423+            throw JSONRPCError(RPC_INTERNAL_ERROR, strprintf("Unable to open database file %s", path.u8string()));

brunoerg commented at 0:46 am on April 26, 2022:

maybe we could cover this error in the functional test, passing a wrong path.

dunxen commented at 8:27 pm on April 26, 2022:

Thanks. Have added one!

in src/rpc/blockchain.h:53 in 2d95f24c35 outdated

47@@ -48,11 +48,18 @@ UniValue blockheaderToJSON(const CBlockIndex* tip, const CBlockIndex* blockindex
48 /** Used by getblockstats to get feerates at different percentiles by weight  */
49 void CalculatePercentilesByWeight(CAmount result[NUM_GETBLOCKSTATS_PERCENTILES], std::vector<std::pair<CAmount, int64_t>>& scores, int64_t total_weight);
50 
51+/** File format type for CreateUTXOSnapshot() */
52+enum class UTXOSnapshotFormat {
53+    COMPACT,

w0xlt commented at 0:48 am on April 26, 2022:

nit: is there a reason to call the original format COMPACT ? Something like BINARY might be more accurate.

0    BINARY,

dunxen commented at 8:24 am on April 26, 2022:

No fundamental reason. I just used nomenclature from the CSV attempt: #18689. Happy to change if we think it’s clearer and people have strong opinions about it.

prusnak commented at 1:12 pm on April 26, 2022:

I like compact more than binary.

w0xlt approved

w0xlt commented at 0:49 am on April 26, 2022: contributor

re-ACK https://github.com/bitcoin/bitcoin/pull/24952/commits/2d95f24c358cfea7bd7cb11c14e26954560c651d

dunxen commented at 7:33 am on April 26, 2022: contributor

So I did try with blobs and managed to get the db down from 11 GB to 7 GB (about 2 GB more than the compact format). There was only really around a 15% increase in performance for me (but I possibly have other bottlenecks on my machine).

theStack commented at 11:52 am on April 26, 2022: contributor

So I did try with blobs and managed to get the db down from 11 GB to 7 GB (about 2 GB more than the compact format). There was only really around a 15% increase in performance for me (but I possibly have other bottlenecks on my machine).

Thanks for testing. Considering these results, it seems more reasonable to store the txid-hashes as text (like it is done currently in the PR), in order to have an easy conversion. Two more thoughts:

The compact-serialized output contains metadata, namely the blockhash of the snapshot and the number of UTXOs. Should we also at least include the blockhash (possibly also block height)? I think the SQLite output shouldn’t contain less information than the old format. (Storing the number of UTXOs seems unnecessary though, as it can simply be determined by select count(*) from utxos).
Do we want to add indices in the course of the conversion, or is that left to the user? Right now, the potential of having the UTXOs in a database is not fully unleashed, since it would still need to linearly search through all ~80 million rows. An index on nValue, nHeight and scriptPubKey probably would make sense, in order to quickly yield results e.g. for UTXOs with a certain amount or block range. Personally I don’t have experience with indices on the TEXT type, but I believe this would help for quickly getting UTXOs with certain output types, e.g. one could query all P2WPKH coins by searching on scriptPubKeys with a prefix of 0014.

prusnak commented at 1:18 pm on April 26, 2022: contributor

The compact-serialized output contains metadata,

We can create another table called metadata with blockhash and blockheight columns.

Do we want to add indices in the course of the conversion

I don’t think we want to push indices on all users that want this feature. They can still create indices afterwards via sqlite3 utxo.db "CREATE INDEX idx_script ON utxo(scriptpubkey)"

dunxen commented at 1:45 pm on April 26, 2022: contributor

I don’t think we want to push indices on all users that want this feature. They can still create indices afterwards via sqlite3 utxo.db "CREATE INDEX idx_script ON utxo(scriptpubkey)"

I can maybe check the extra resource requirements and we can have a clearer picture on whether it would be worth it? The one thing is that without any indices then a simple SELECT COUNT(*) FROM utxos takes a while. Of course, it is a ROWID table so doing SELECT MAX(rowid) FROM utxos is constant time and works out the box.

Otherwise, happy to keep indices out of it. I’ll create the metadata table in my next push.

dunxen commented at 8:13 pm on April 26, 2022: contributor

I can maybe check the extra resource requirements and we can have a clearer picture on whether it would be worth it? The one thing is that without any indices then a simple SELECT COUNT(*) FROM utxos takes a while. Of course, it is a ROWID table so doing SELECT MAX(rowid) FROM utxos is constant time and works out the box.

So dumping with indices will take significantly longer. On my iMac it takes longer than the default rpcclienttimeout (so I have to disable timeout). The index for the height column alone adds about 1 GB to the overall size (but this is not really the main issue).

I think it’s fair to leave it to the user to create the indices they are most interested in. For now I think I’ll just push the metadata table and test changes and then we can decide from there.

dunxen force-pushed on Apr 26, 2022

in src/rpc/blockchain.cpp:2449 in 39c392af35 outdated

2461+
2462+        while (pcursor->Valid()) {
2463+            if (iter % 5000 == 0) node.rpc_interruption_point();
2464+            ++iter;
2465+            if (pcursor->GetKey(key) && pcursor->GetValue(coin)) {
2466+                sqlite3_bind_text(stmt, 1, key.hash.GetHex().c_str(), strlen(key.hash.GetHex().c_str()), SQLITE_TRANSIENT);

theStack commented at 11:42 am on April 28, 2022:

Rather than calling uint256.GetHex() twice per iteration and determining its size via strlen, could create the string once and use its .size() method:

0                std::string prevouthash_hex{key.hash.GetHex()};
1                sqlite3_bind_text(stmt, 1, prevouthash_hex.c_str(), prevouthash_hex.size(), SQLITE_TRANSIENT);

In theory one could even pass the length directly as constant (64) since GetHex() results always have that size (then the string variable is not even needed).

dunxen commented at 4:13 pm on April 28, 2022:

I think I’ll stick to the creating the string as you mentioned, just so it’s clearer.

in src/rpc/blockchain.cpp:2454 in 39c392af35 outdated

2466+                sqlite3_bind_text(stmt, 1, key.hash.GetHex().c_str(), strlen(key.hash.GetHex().c_str()), SQLITE_TRANSIENT);
2467+                sqlite3_bind_int(stmt, 2, key.n);
2468+                sqlite3_bind_int64(stmt, 3, coin.out.nValue);
2469+                sqlite3_bind_int(stmt, 4, coin.fCoinBase);
2470+                sqlite3_bind_int(stmt, 5, coin.nHeight);
2471+                sqlite3_bind_text(stmt, 6, HexStr(coin.out.scriptPubKey).c_str(), strlen(HexStr(coin.out.scriptPubKey).c_str()), SQLITE_TRANSIENT);

theStack commented at 11:47 am on April 28, 2022:

Similar to the above suggestion:

0                std::string scriptpubkey_hex{HexStr(coin.out.scriptPubKey)};
1                sqlite3_bind_text(stmt, 6, scriptpubkey_hex.c_str(), scriptpubkey_hex.size(), SQLITE_TRANSIENT);

theStack commented at 12:04 pm on April 28, 2022: contributor

I don’t think we want to push indices on all users that want this feature. They can still create indices afterwards via sqlite3 utxo.db “CREATE INDEX idx_script ON utxo(scriptpubkey)”

Agreed, seems to be the right decision to not create any indices and leave that up to the user, considering the significant performance loss for dumping.

Did some testing and everything seems to work fine so far. Creating the SQLite format takes a little more than 2x longer than the compact-serialized format on my machine, but that’s not really surprising and actually quite okay I would say, considering one gets a nice database format.

 0$ time ./src/bitcoin-cli dumptxoutset utxo.dat
 1{
 2  "coins_written": 81976086,
 3  "base_hash": "00000000000000000004f5b8fa74921f24ac154e2725f7cdc8b0542568064308",
 4  "base_height": 733945,
 5  "path": "/home/honeybadger/.bitcoin/utxo.dat",
 6  "txoutset_hash": "5b6936caddd3fa8e5e9ef55843c4d8cf5069ad3e56e415f963776f06fdc097c7",
 7  "nchaintx": 729190514
 8}
 9
10real    4m13.964s
11user    0m0.001s
12sys     0m0.004s

 0$ time ./src/bitcoin-cli dumptxoutset utxo.sqlite sqlite
 1{
 2  "coins_written": 81976086,
 3  "base_hash": "00000000000000000004f5b8fa74921f24ac154e2725f7cdc8b0542568064308",
 4  "base_height": 733945,
 5  "path": "/home/honeybadger/.bitcoin/utxo.sqlite",
 6  "txoutset_hash": "5b6936caddd3fa8e5e9ef55843c4d8cf5069ad3e56e415f963776f06fdc097c7",
 7  "nchaintx": 729190514
 8}
 9
10real    9m17.225s
11user    0m0.003s
12sys     0m0.002s

Some playing around with the database in the SQLite interpreter. Of the few queries I tried, only select(*) from utxos took terribly long (actually longer than the time needed to create the database!), so select max(rowid) from utxos definitely is the better alternative here (works only as long as no rows have been deleted though):

 0$ sqlite3 ~/.bitcoin/utxo.sqlite
 1SQLite version 3.38.2 2022-03-26 13:51:10
 2Enter ".help" for usage hints.
 3sqlite> .headers on
 4sqlite> .mode column
 5sqlite> .timer on
 6sqlite>
 7sqlite> select * from metadata;
 8blockhash                                                     blockheight
 9------------------------------------------------------------  -----------
1000000000000000000004f5b8fa74921f24ac154e2725f7cdc8b054256806  733945
114308
12Run Time: real 0.000 user 0.000483 sys 0.000000
13
14sqlite> select count(*) from utxos;
15count(*)
16--------
1781976086
18Run Time: real 721.525 user 2.416016 sys 71.367933
19
20sqlite> .schema
21CREATE TABLE metadata(blockhash TEXT, blockheight INT);
22CREATE TABLE utxos(txid TEXT, vout INT, value INT, coinbase INT, height INT, scriptpubkey TEXT);
23
24
25sqlite> create index value_idx on utxos(value);
26Run Time: real 242.699 user 84.718367 sys 106.091081
27
28sqlite> select sum(value) from utxos;
29sum(value)
30----------------
311902444842955153
32
33sqlite> select value, scriptpubkey from utxos where value = 666666666;
34value      scriptpubkey
35---------  --------------------------------------------------
36666666666  76a9149adfb7afb0d1727c30bd774b4c2a3bd30eda402b88ac
37666666666  0014874270f22e6e68c5f1bfdac7a1d2b10c55766b0d
38666666666  76a914292797f0e2e6d80902f34ec4743a31c0fb020dde88ac
39Run Time: real 0.002 user 0.000396 sys 0.000000
40
41sqlite> select count(*) from utxos where value = 100000000;
42count(*)
43--------
44123871
45Run Time: real 0.007 user 0.004100 sys 0.002324
46
47sqlite> select max(rowid) from utxos;
48max(rowid)
49----------
5081976086
51Run Time: real 0.002 user 0.000221 sys 0.000000
52
53sqlite> select height, count(*) as c from utxos group by height order by c desc limit 10;
54height  c
55------  -----
56365056  18151
57506320  18017
58364809  17668
59365083  17636
60475099  17629
61364853  17558
62364882  17547
63364962  17221
64364798  16942
65364858  16835
66Run Time: real 7.230 user 6.362504 sys 0.577103

TIL that currently most of the entries in the UTXO set originated in block 365056 :)

Left two review comments below. I think this feature is nice enough to also deserve a small release note?

dunxen commented at 4:36 pm on April 28, 2022: contributor

so select max(rowid) from utxos definitely is the better alternative here (works only as long as no rows have been deleted though)

Yes, the one caveat with that! Thanks for running a few nice examples :D

dunxen force-pushed on Apr 28, 2022

laanwj commented at 5:56 pm on April 28, 2022: member

Concept ACK. I was slightly surprised at first to see sqlite used as a format outside the wallet, but I think it makes sense, it’s a flexible and efficient data format, all languages worth their salt have bindings for it, without need for writing custom parsers. We also don’t need to introduce a new dependency which we’d have to for another compact format like BSON.

What I wonder, does this need special handling in the build system? E.g. would this need some use-sqlite-but-no-wallet flag?

Agreed, seems to be the right decision to not create any indices and leave that up to the user, considering the significant performance loss for dumping.

I agree. Not adding indices makes the RPC call less complex, and if you know when you’re done putting data in a database, I would think it’s more efficient to create the indices afterward (processing the data in one go) anyway.

dunxen commented at 6:08 pm on April 28, 2022: contributor

What I wonder, does this need special handling in the build system? E.g. would this need some use-sqlite-but-no-wallet flag?

Ah, yes. This is a good point! --with-sqlite’s auto option just checks for wallet being enabled and sqlite being available. It might be weird for this functionality to work if wallet is enabled and then not working if it’s disabled if --with-sqlite=yes is not passed.

dunxen commented at 6:12 pm on April 28, 2022: contributor

Also see I need to rebase and update ValidAsCString(): https://github.com/bitcoin/bitcoin/commit/fa7078d84fc2858a466bc1a85404f821df682538

dunxen force-pushed on Apr 28, 2022

Sjors commented at 9:43 am on April 29, 2022: member

@laanwj wrote:

What I wonder, does this need special handling in the build system? E.g. would this need some use-sqlite-but-no-wallet flag?

I think we just need to document that this feature only works when compiled with the wallet or configured with Sqlite support. Inside configure.ac it would make sense to move use_sqlite outside of the wallet conditional:

0if test "$enable_wallet" != "no"; then
1    echo "    with sqlite   = $use_sqlite"
2    echo "    with bdb      = $use_bdb"
3fi

theStack referenced this in commit bc23bd133f on May 1, 2022

dunxen force-pushed on May 2, 2022

willcl-ark commented at 10:02 pm on May 5, 2022: contributor

reACK git range-diff 889f6c719dbd2bca5f38a6a5a89f39fb4639558f...8fe0e7c48e651fdf981675d0dd1542ec49ea91b7

in configure.ac:2019 in 8fe0e7c48e outdated

1998@@ -1999,9 +1999,9 @@ echo "  with experimental syscall sandbox support = $use_syscall_sandbox"
1999 echo "  with libs       = $build_bitcoin_libs"
2000 echo "  with wallet     = $enable_wallet"
2001 if test "$enable_wallet" != "no"; then
2002-    echo "    with sqlite   = $use_sqlite"
2003     echo "    with bdb      = $use_bdb"
2004 fi
2005+echo "  with sqlite     = $use_sqlite"

luke-jr commented at 10:04 pm on May 6, 2022:

Also need to actually check for and enable sqlite without wallet build

in src/rpc/blockchain.cpp:2301 in 8fe0e7c48e outdated

2297     if (fs::exists(path)) {
2298         throw JSONRPCError(
2299             RPC_INVALID_PARAMETER,
2300             path.u8string() + " already exists. If you are sure this is what you want, "
2301-            "move it out of the way first");
2302+                                "move it out of the way first");

luke-jr commented at 10:05 pm on May 6, 2022:

wrong indentation

in src/rpc/blockchain.cpp:2411 in 8fe0e7c48e outdated

2439+        if (ret != SQLITE_OK) {
2440+            sqlite3_close(db);
2441+            throw JSONRPCError(RPC_INTERNAL_ERROR, strprintf("SQLiteDatabase: Failed to insert into 'metadata' table: %s\n", sqlite3_errstr(ret)));
2442+        }
2443+
2444+        const char* sql = "CREATE TABLE utxos(txid TEXT, vout INT, value INT, coinbase INT, height INT, scriptpubkey TEXT);";

luke-jr commented at 10:13 pm on May 6, 2022:

vout is the wrong term

luke-jr commented at 10:15 pm on May 6, 2022:

At least scriptpubkey should probably be a binary BLOB?

theStack commented at 1:18 pm on May 7, 2022:

What would be a better alternative? Simply n? Right now vout is already used in quite some RPC calls to refer to an outpoint’s index (e.g. in the resut of scantxoutset).

theStack commented at 1:23 pm on May 7, 2022:

One drawback with changing it to a BLOB is that the simply query select * from utxos doesn’t yield a readable result anymore (see also #24952 (comment)). On the other hand the dumps would be a bit smaller 🤔

luke-jr changes_requested

fanquake commented at 10:59 am on May 7, 2022: member

What I wonder, does this need special handling in the build system? E.g. would this need some use-sqlite-but-no-wallet flag?

Yes I think it does. However this change is done, it should not be adding an auto-detected/enabled by default dependency on sqlite, for a “nice to have” RPC feature.

theStack approved

theStack commented at 3:00 pm on May 10, 2022: contributor

Lightly tested ACK 8fe0e7c48e651fdf981675d0dd1542ec49ea91b7 🗄️ (modulo needed build system changes w.r.t. sqlite)

In the course of researching possible alternatives for this PR I created a tool for converting an UTXO set dump from compact-serialized format to SQLite format (see https://github.com/theStack/utxo_dump_tools). Of course the tool will be obsolete after this PR is merged, but having it available enables a first step to lightly test the integrity of the created data (and also the correctness of conversion tool) for now by

dumping UTXO set in legacy format -> convert it to SQLite
dumping UTXO set directly in SQLite format (as introduced in this PR)
comparing the contents of these two databases

The SQLite files are not equal on a byte-by-byte basis after dropping the metadata table (I guess there is possibly some internal metadata for SQLite as well that could differ), but their dumped data in e.g. CSV format is exactly equal.

Note that since dozens of gigabytes are converted and processed several times in this testing scenario, it takes quite a while, it was about ~2 hours in total on my machine (on testnet or signet it should be significantly faster, of course). If anyone wants to follow the steps:

 0<run bitcoind of [#24952](/bitcoin-bitcoin/24952/)>
 1$ bitcoin-cli dumptxoutset utxos.compact compact
 2$ bitcoin-cli dumptxoutset utxos.sqlite sqlite
 3(make sure that both dumptxoutset calls happen at the same block height!)
 4$ git clone https://github.com/theStack/utxo_dump_tools
 5$ cd utxo_dump_tools/utxo_to_sqlite
 6$ go run utxo_to_sqlite ~/.bitcoin/utxos.compact ~/.bitcoin/utxos_from_compact.sqlite
 7$ sqlite3 -header -csv ~/.bitcoin/utxos.sqlite "SELECT * FROM utxos" > utxos.csv
 8$ sqlite3 -header -csv ~/.bitcoin/utxos_from_compact.sqlite "SELECT * FROM utxos" > utxos_from_compact.csv
 9$ sha256sum utxos.csv utxos_from_compact.csv
10< hashes of both files should match >

For mainnet, block height 735763 (00000000000000000000104512ae0ed032068ee654c328b524b6d1a743472509):

0$ sha256sum *.csv
120dd68ae4986735692e9109b49572f08240bd8637d663008e6ad7924df4b8bac  utxos.csv
220dd68ae4986735692e9109b49572f08240bd8637d663008e6ad7924df4b8bac  utxos_from_compact.csv

Even nicer would be the possibility to calculate the UTXO set hash of a SQLite database (which can be easily compared to the output of a call to the gettxoutsetinfo RPC), I’m working on a verification tool implementation using MuHash which should be ready soon.

dunxen commented at 3:24 pm on May 10, 2022: contributor

(modulo needed build system changes w.r.t. sqlite)

Working on that today. Apologies for the delay!

In the course of researching possible alternatives for this PR I created a tool for converting an UTXO set dump from compact-serialized format to SQLite format (see https://github.com/theStack/utxo_dump_tools). Of course the tool will be obsolete after this PR is merged, but having it available enables a first step to lightly test the integrity of the created data (and also the correctness of conversion tool) for now by

Thank you! This is pretty helpful for testing.

The SQLite files are not equal on a byte-by-byte basis after dropping the metadata table (I guess there is possibly some internal metadata for SQLite as well that could differ), but their dumped data in e.g. CSV format is exactly equal.

Yeah, I’d imagine something of the sort w.r.t. the actual SQLite file metadata.

Even nicer would be the possibility to calculate the UTXO set hash of a SQLite database (which can be easily compared to the output of a call to the gettxoutsetinfo RPC), I’m working on a verification tool implementation using MuHash which should be ready soon.

Would definitely be keen to try it out when ready!

fanquake commented at 4:22 pm on May 10, 2022: member

Of course the tool will be obsolete after this PR is merged,

Alternatively, the existence of the tool may make the PR obsolete. If it’s possible for Core to just maintain a single, compact, dependency-less output format, and it’s easy for devs or power users to write & use tools that can convert that compact format into whichever other formats they need, that seems superior to introducing more code (to maintain) into Core, and a new non-wallet sqlite dependency.

theStack commented at 9:43 pm on May 10, 2022: contributor

Of course the tool will be obsolete after this PR is merged,

Alternatively, the existence of the tool may make the PR obsolete. If it’s possible for Core to just maintain a single, compact, dependency-less output format, and it’s easy for devs or power users to write & use tools that can convert that compact format into whichever other formats they need, that seems superior to introducing more code (to maintain) into Core, and a new non-wallet sqlite dependency.

That’s a valid counter-argument for sure and was also the initial reason I started writing this tool, as I wasn’t convinced whether adding a second dump format with a dependency is really worth the additional code and maintenance burden (see #24628 (comment) for more details). The grave drawback of the current UTXO dump format though is that it’s solely designed for being compact but nothing else, i.e. it’s not useful for inspection at all and unusable without further processing; too many fields are compressed in a non-trivial way. As the worst example, you even need elliptic curve math (modular field arithmetics on large integers, basically a subset of the secp256k1 library) to restore the scriptPubKeys for P2PK outputs, so it is unfortunately not really “dependency-less”.

A good dump format should IMHO contain only the pure data without trying to encode things in a smart non-standardized way that forces users to process it further to do even basic inspection. Since handling those huge amounts of data (>80 million rows on mainnet, currently) barely makes sense without having some basic form of database, using a very widespread and stable single-file format like SQLite seems still like a good choice to me.

pk-b2 commented at 10:34 pm on May 10, 2022: none

Is it worth to consider to just write it to stdout while iterating all utxos?

That would make it easy for a consumer to bring it into any required format for further processing or analyzing without having Core to make that decision.

sqlite3 is able to capture from stdin as an import source and most other formats like csv or json can already be processed via the command line. You could even already apply filtering while dumping all records.

fanquake commented at 7:45 am on May 11, 2022: member

The grave drawback of the current UTXO dump format though is that it’s solely designed for being compact but nothing else, i.e. it’s not useful for inspection at all and unusable without further processing; too many fields are compressed in a non-trivial way.

It really sounds like we should be improving the current format (I don’t think we need to worry about backwards compatibility too much), given the issues you’ve outlined, rather than adding a new format, more code, and a dependency we don’t want. Otherwise we’ve still got to maintain the old format, which is compressed (convenient), but not really useful unless you write your own parser (clearly doable), and also maintain the new format, which is seemingly only being introduced because the current format isn’t really usable for what everyone wants.

dunxen force-pushed on May 11, 2022

theStack commented at 11:09 am on May 11, 2022: contributor

It really sounds like we should be improving the current format (I don’t think we need to worry about backwards compatibility too much), given the issues you’ve outlined, rather than adding a new format, more code, and a dependency we don’t want. Otherwise we’ve still got to maintain the old format, which is compressed (convenient), but not really useful unless you write your own parser (clearly doable), and also maintain the new format, which is seemingly only being introduced because the current format isn’t really usable for what everyone wants.

The compact-serialized format was introduced as part of the AssumeUTXO project (see #16899) for creating and loading UTXO snapshots. Not sure if at this point changing the specification is desired/needed, I guess for this specific use-case the format is just fine (maybe @jamesob can comment on this?). So it seems that we really have two different goals here: on the one hand a snapshot format to be saved/loaded by bitcoind or even other clients (and that in the most space-efficient way possible), on the other hand providing a neutral easy-to-process dump-format that can be inspected by users. I agree that it would be the best solution if both of these goals could be met by a single format.

dunxen commented at 4:55 pm on May 11, 2022: contributor

There’s an issue in the latest push with configuring with --disable-wallet --with-sqlite=yes with the build system. I’ll take a look in the morning. USE_SQLITE comes out true as expected, though. Maybe something that the wallet flag is guarding.

rpc: Add sqlite format option for dumptxoutset

This allows RPC users of the `dumptxoutset` command to output to a SQLite DB format with human-readable fields.
This approach allows for trivial export to JSON and CSV by using `sqlite3` (detailed in #24628).

There is a new optional 'format' parameter for `dumptxoutset` that can take on the values of either 'compact' or 'sqlite'
with 'compact' as the default. This allows for backward compatibility.

Closes #24628

1db3a52c5b

doc: Add dumptxoutset format parameter to release notes c5506ac298

dunxen force-pushed on May 17, 2022

DrahtBot added the label Needs rebase on May 24, 2022

DrahtBot commented at 1:39 pm on May 24, 2022: contributor

🐙 This pull request conflicts with the target branch and needs rebase.

Want to unsubscribe from rebase notifications on this pull request? Just convert this pull request to a “draft”.

laanwj commented at 2:08 pm on May 31, 2022: member

. I agree that it would be the best solution if both of these goals could be met by a single format.

This could be argued the other way around too: maybe it’s good to have an internal and an external format. The internal format could be changed more easily when necessary for our internal purposes. It’s like the block index, the utxo index, etc. Someone can write a perser but there are no guarantees of compatibility. Also, it only needs the data we need to reconstruct from a snapshot, nothing more, nothing less.

However, changes in the external format, a documented API, have to take external clients into account.

jamesob commented at 5:20 pm on June 6, 2022: member

Alternatively, the existence of the tool may make the PR obsolete. If it’s possible for Core to just maintain a single, compact, dependency-less output format, and it’s easy for devs or power users to write & use tools that can convert that compact format into whichever other formats they need, that seems superior to introducing more code (to maintain) into Core, and a new non-wallet sqlite dependency.

So it seems that we really have two different goals here: on the one hand a snapshot format to be saved/loaded by bitcoind or even other clients (and that in the most space-efficient way possible), on the other hand providing a neutral easy-to-process dump-format that can be inspected by users. @fanquake @theStack yeah, the existing format is intended to be as space efficient as possible since, ideally, that output will be commonly shared across network to help get nodes going. That consideration, in conjunction with

external tools needing libsecp to render the compressed data, and
the obligation that would be introduced by making a previously opaque data blob “publicly” consumable

makes me sympathetic to this changeset. I think as long as this feature gracefully degrades when sqlite support isn’t compiled in, I’m concept ACK and will be happy to review this soon.

laanwj commented at 4:52 pm on June 22, 2022: member

Needs a trivial rebase for #24410:

0SnapshotMetadata metadata{tip->GetBlockHash(), maybe_stats->coins_count, tip->nChainTx};

Edit: i get the following error locally while running the rpc_dumptxoutset.py test:

 0 test  2022-06-22T17:09:26.669000Z TestFramework (ERROR): Assertion failed
 1                                   Traceback (most recent call last):
 2                                     File "/store/user/projects/bitcoin/bitcoin/test/functional/test_framework/util.py", line 134, in try_rpc
 3                                       fun(*args, **kwds)
 4                                     File "/home/user/projects/bitcoin/bitcoin/test/functional/rpc_dumptxoutset.py", line 95, in <lambda>
 5                                       assert_raises_rpc_error(-32603, 'Unable to open database file', lambda: node.dumptxoutset('bad_path/' + FILENAME, 'sqlite'))
 6                                     File "/store/user/projects/bitcoin/bitcoin/test/functional/test_framework/coverage.py", line 49, in __call__
 7                                       return_val = self.auth_service_proxy_instance.__call__(*args, **kwargs)
 8                                     File "/store/user/projects/bitcoin/bitcoin/test/functional/test_framework/authproxy.py", line 144, in __call__
 9                                       raise JSONRPCException(response['error'], status)
10                                   test_framework.authproxy.JSONRPCException: Couldn't open file /tmp/test_runner_₿_🏃_20220622_190924/rpc_dumptxoutset_0/node0/regtest/bad_path/txoutset.db.incomplete for writing. (-8)

Looks like it expects error code -32603 but gets -8.

fanquake commented at 8:07 pm on June 22, 2022: member

Concept NACK - I hope we don’t go down this route. We’ve spent years trying to reduce the number of external dependencies in Bitcoin Core; especially in the option-less build. At this point in the project, I think adding new, or expanding the scope of, dependencies in this code base, should be reserved for things that are actually important/mission critical, i.e libsecp256k1, not for nice-to-have features, that are going to be used by a (relatively) small number of developers and enthusiasts, and not actually by the majority of nodes running this peice of software.

If anything, Bitcoin Core should be removing more code, and creating external tooling for new features, rather than continuing to accumulate functionality in this codebase because it’s convenient. In this case, an external tool even exists already (I don’t really understand why it needing libsecp as a dependency is a problem). My longer term conern is that if something like this is merged, in 3 months time, we’ll have a new pull-request, to add support for using sqlite to dump out logs in json format, or some other thing, that clearly shouldn’t be a concern/maintenance burnden of Bitcoin Core.

This PR also gives us somewhat awkward build logic, where an RPC feature is enabled if a user compiles with a certain dependency, that previously, only controlled whether or not you got a certain type of wallet.

How is this going to work with #24202? Are we abandoning that PR in favour of this, or will we end up with a trifecta of output formats?

In any case, this PR (rebased) is currently broken on arm64 macOS, when building with --disable-wallet --with-sqlite. sqlite is available, and building master with the same options works as expected:

0  CXXLD    bitcoin-cli
1  CXXLD    minisketch/test
2  CXXLD    bitcoin-tx
3  CXXLD    bitcoin-util
4ar: -L/opt/homebrew/Cellar/sqlite/3.38.5/lib: No such file or directory
5ar: -lsqlite3: No such file or directory
6gmake[2]: *** [Makefile:5897: libbitcoin_node.a] Error 1

theStack commented at 10:08 pm on June 22, 2022: contributor

@fanquake: Great points raised that should be considered, I agree that an additional external dependency definitely has its drawbacks. Still, the somewhat uncomfortable fact is that right now we don’t have a proper “dumb” (sic!) dump format for the UTXO set that only contains pure information and thus can easily be searched/processed. I think there should be one, as the current compact-serialized format has a different purpose and won’t change, and considering that the UTXO set is one of the (if not the) most important data-structures on all nodes (be it pruned or non-pruned), as it reflects the current state / distribution of funds with all (visible) spending conditions, I would argue it’s worth to make an exception here to the “we don’t add convenience features if the same result can also be achieved somehow with external tools” rule. This enables the user to easily inspect that state without the need for further fiddling around between different formats and investing extra time and disk space.

How is this going to work with #24202? Are we abandoning that PR in favour of this, or will we end up with a trifecta of output formats?

I think it’s pretty obvious that this PR would obsolete #24202, at least I can’t come up with a good reason why anyone would still be interested in a text format if you can already have a superset in the form of a simple but still powerful database (converting from SQLite to CSV format is not a big deal)? Though, if this PR is too controversial, it’s probably still better to have #24202 instead of no proper UTXO set dump format at all. A compromise could be #24202 plus as follow-up a trivial tool (e.g. in contrib/ or external) that converts the text output to a SQLite database? (Would maybe still be faster than the current approach of emiting a compact-serialized dump + converting that to SQLite format via utxo_to_sqlite). Very curious to hear others opinions.

laanwj commented at 8:06 am on June 23, 2022: member

How is this going to work with #24202? Are we abandoning that PR in favour of this, or will we end up with a trifecta of output formats?

I think we should settle on two formats. The format for internal use and a format for external use. From there on, it’s for people with custom scripts to analyze it or convert it into anything they want (it’s not like end users have a use for an UTXO dump anyway). As external format I think sqlite is a surprisingly good choice for these kind of huge data dumps (better than say, JSON, or formats that require writing custom parsers). It should still be possible to compile the node without sqlite though for people (most) who don’t need external UTXO exports.

Though, if this PR is too controversial, it’s probably still better to have #24202 instead of no proper UTXO set dump format at all. A compromise could be #24202 plus as follow-up a trivial tool (e.g. in contrib/ or external) that converts the text output to a SQLite database? (Would maybe still be faster than the current approach of emiting a compact-serialized dump + converting that to SQLite format via utxo_to_sqlite). Very curious to hear others opinions.

I agree.

jamesob commented at 3:56 pm on June 23, 2022: member

I think answers to these questions would be illuminating for comparison to #24202:

is there a considerable difference in final output size for CSV vs. sqlite? is one more space efficient than the other?
what is the code complexity and runtime of, say, a Python script that writes streaming CSV input into sqlite?

If space is comparable between the two options, and a script that adapts CSV -> sqlite is easy (I think it would be) and relatively quick (not sure here), I would say that #24202 is preferable, since it would ship with all builds. We could even bundle such a script in contrib/.

That said, I think dependency considerations aside, sqlite is probably a much more practical output format for actual consumption. Reading sqlite into, say, pandas for further analysis is as trivial CSV - in fact, based on the function signatures of pandas.read_csv vs. pandas.read_sql_table, sqlite is probably the easier choice.

theStack commented at 5:47 pm on June 23, 2022: contributor

is there a considerable difference in final output size for CSV vs. sqlite? is one more space efficient than the other?

Did a quick comparison between a SQLite output that I created with this PR ~3 weeks ago (82401330 UTXO entries) and a text output with #24202 that I created just a few minutes ago (82801821 UTXO entries):

0$ ls -l ~/.bitcoin/utxos.sqlite ~/.bitcoin/utxo_set.txt
1[...] 11009749503 Jun 23 19:20 /home/honeybadger/.bitcoin/utxo_set.txt
2[...] 12445270016 Jun  3 15:44 /home/honeybadger/.bitcoin/utxos.sqlite

Both seem to take roughly ~2x of the size of the compact-serialized format, with the text output even being a bit smaller (~10.25 GB vs ~11.59 GB). The vast majority of the space per entry is taken by txid and scriptPubKey, and since we store them as hex-encoded strings (both in this PR and in #24202), it makes sense that the sizes are in a similar area.

what is the code complexity and runtime of, say, a Python script that writes streaming CSV input into sqlite?

Pretty sure that the code complexity is low (seems like sqlite3 can even directly import CSV files and it could be a very simple shell-script? https://stackoverflow.com/a/1045961), while the runtime could be a bit of a show-stopper here. Will try to write such a script within the next days and report the results then.

JeremyRubin commented at 6:24 pm on June 23, 2022: contributor

Unfortunately, there is no built-in “reverse” operator in SQLite :/ If there was, we could simply do something like the following for the conversion script: SELECT hex(reverse(txid)), vout, value, coinbase, height, hex(scriptpubkey) FROM utxos

Actually, you can use the SUBSTR and LENGTH https://www.sqlite.org/lang_corefunc.html#substr

The substr(X,Y,Z) function returns a substring of input string X that begins with the Y-th character and which is Z characters long. If Z is omitted then substr(X,Y) returns all characters through the end of the string X beginning with the Y-th. The left-most character of X is number 1. If Y is negative then the first character of the substring is found by counting from the right rather than the left. If Z is negative then the abs(Z) characters preceding the Y-th character are returned. If X is a string then characters indices refer to actual UTF-8 characters. If X is a BLOB then the indices refer to bytes.

so in theory:

0SELECT hex(SUBSTR(txid, -1, -LENGTH(txid))), vout, value, coinbase, height, hex(scriptpubkey) FROM utxos

works.

In general I think dumping to sqlite sounds like a fine idea. One point of contention though is it would be nice if we could return the SQLITE file over the network somehow, since not all RPC users have local filesystem access.

theStack commented at 11:54 am on June 24, 2022: contributor

Unfortunately, there is no built-in “reverse” operator in SQLite :/ If there was, we could simply do something like the following for the conversion script: SELECT hex(reverse(txid)), vout, value, coinbase, height, hex(scriptpubkey) FROM utxos

Actually, you can use the SUBSTR and LENGTH https://www.sqlite.org/lang_corefunc.html#substr

[…]

so in theory:
0SELECT hex(SUBSTR(txid, -1, -LENGTH(txid))), vout, value, coinbase, height, hex(scriptpubkey) FROM utxos
works.

Unfortunately that doesn’t work. The substring is still extracted from left to right without any reversing operation involved, the only thing possible is to specify the start and end offsets by counting from right to left:

0$ sqlite3
1SQLite version 3.38.2 2022-03-26 13:51:10
2Enter ".help" for usage hints.
3Connected to a transient in-memory database.
4Use ".open FILENAME" to reopen on a persistent database.
5sqlite> select hex(substr(x'11223344', -1, -4));
6112233

If reversing in SQLite was possible, it would be very tempting to store the txid and scriptPubkey as BLOBs to have more compact dumps that are created faster.

JeremyRubin commented at 5:55 pm on June 24, 2022: contributor

ah interesting… since we know TXIDs are 32 bytes, one option would be to do a 32 byte expansion of something like:

substr(‘abc’,3, 1) || substr(‘abc’, 2, 1) || substr(‘abc’, 1,1)

as a generated virtual table. this would have no storage overhead, but would have some overhead on retrieval.

achow101 commented at 7:14 pm on October 12, 2022: member

Are you still working on this?

cc @jamesob

dunxen commented at 7:25 pm on October 12, 2022: contributor

Are you still working on this?

cc @jamesob

Will get back to it this week, been busy with other work :)

achow101 commented at 5:15 pm on November 10, 2022: member

Will get back to it this week, been busy with other work :)

It’s been several weeks since this comment with no additional updates, so I’m going to close this for now. If you do get around to working on this, please leave a comment so that it can be reopened.

achow101 closed this on Nov 10, 2022

bitcoin locked this on Nov 10, 2023

rpc: Add sqlite format option for dumptxoutset #24952

Time

Size

Conflicts