contrib: utxo_to_sqlite.py: add option to store txid/spk as BLOBs #32621

pull theStack wants to merge 2 commits into bitcoin:master from theStack:202505-utxo-to-sqlite_blobs changing 2 files +60 −22
  1. theStack commented at 3:03 am on May 27, 2025: contributor

    This PR is a late follow-up to #27432, introducing an option for the utxo-to-sqlite script to store the txid/scriptPubKey columns as bytes (= BLOB storage class in sqlite, see e.g. https://www.sqlite.org/datatype3.html in sqlite) rather than hex strings. This was proposed in earlier reviews (https://github.com/bitcoin/bitcoin/pull/27432#issuecomment-1516857024, #27432 (comment)) and has the obvious advantage of a significantly smaller size of the resulting database (and with that, faster conversion) and the avoidance of hex-to-bytes conversion for further processing of the data [1]. The rationale on why hex strings were chosen back then (and still stays the default, if only for compatibility reasons) is laid out in #27432 (comment) [2].

    The approach taken is introducing new parameters --spk and --txid which can either have the values “hex”, “raw” (for scriptpubkey) and “hex”, “raw”, “rawle” (for txid). Thanks to ajtowns for providing this suggestion. Happy to take further inputs on naming and thoughts on future extensibility etc.

    [1] For a concrete example, I found that having these columns as bytes would be nice while working on a SwiftSync hints generator tool (https://github.com/theStack/swiftsync-hints-gen), which takes the result of the utxo-to-sqlite tool as input. [2] note that in contrast what I wrote back then, I think there is no ambiguity on byte-string-serialization of txids; they are ultimately just hash results and hence, they should be stored as such, and adding a big/little endian knob wouldn’t make much sense. The drawback of not being able to immediately show txid-strings (as one would need to do the bytes-reversal step first, which is not possible in sqlite, see e.g. #24952 (comment)) still remains though.

  2. DrahtBot commented at 3:03 am on May 27, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32621.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK w0xlt

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #32116 (contrib: refactor: dedup deserialization routines in utxo-to-sqlite script by theStack)
    • #31560 (rpc: allow writing UTXO set to a named pipe, introduce dump_to_sqlite.sh script by theStack)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

    LLM Linter (✨ experimental)

    Possible typos and grammar issues:

    • if –spk=raw, then scriptpubkey will be BLOB instead. -> If –spk=raw, then scriptpubkey will be BLOB instead. [sentence start should be capitalized]

    No other typos were found.

    drahtbot_id_4_m

  3. DrahtBot added the label Scripts and tools on May 27, 2025
  4. ajtowns commented at 4:40 am on May 27, 2025: contributor

    The approach taken is introducing a -m/--mode parameter which can either have the values “hex” (default) or “bytes”. Happy to take suggestions on naming and thoughts on future extensibility etc.

    I wonder if separate options would be better, eg --spk=raw --txid=raw?

    I think there is no ambiguity on byte-string-serialization of txids; they are ultimately just hash results and hence, they should be stored as such, and adding a big/little endian knob wouldn’t make much sense.

    I think it will be surprising if hex(txid) gives “reversed” values so that just copy and pasting into block explorers doesn’t work. If you did separate flags, you could consider --txid=hex, --txid=raw (same byte order as hex), and --txid=rawhash (same byte order as sha256 output) as separate options, maybe?

    Here’s a patch: https://github.com/ajtowns/bitcoin/commit/d1fc5dff9afc2c05aa975fa7b29c029865d77507

    Could consider allowing --spk=none --txid=none to drop both fields for further space saving if you’re just analysing the utxo set, not doing things with it (or --txid=counter to replace the actual txid with a counter that saves space but still lets you notice when utxos are from the same tx).

    The drawback of not being able to immediately show txid-strings (as one would need to do the bytes-reversal step first, which is not possible in sqlite, see e.g. #24952 (comment)) still remains though.

    Seems like you can get addons for this, for whatever that’s worth.

  5. theStack force-pushed on May 27, 2025
  6. theStack commented at 2:00 pm on May 27, 2025: contributor

    @ajtowns: Thanks for your suggestions, I like the idea of having more flexibility and mostly agree with the naming. Took your patch (TIL about argparse choices) and adapted the functional test accordingly, to exercise all combinations of new options. As for txid format option naming, I think I’d intuitively prefer raw for storing them in the hash order, since that’s AFAICT the most common txid byte-string serialization (I think I haven’t encountered any counter-examples so far in the Bitcoin ecosystem) – but then one would need to pick a good name for the “reverse-hash” byte-string serialization again. So maybe it’s just fine :man_shrugging: Personally, with the current naming, I’d only either wanted to use the combinations --txid=hex --spk=hex (i.e. the default, the only possible on master), or --txid=rawhash --spk=raw.

    Could consider allowing –spk=none –txid=none to drop both fields for further space saving if you’re just analysing the utxo set, not doing things with it (or –txid=counter to replace the actual txid with a counter that saves space but still lets you notice when utxos are from the same tx).

    Pretty neat ideas, and potentially useful for e.g. dust (and other “spam”) or output script type analysis. Will consider for further follow-ups.

  7. theStack force-pushed on May 27, 2025
  8. theStack force-pushed on May 27, 2025
  9. DrahtBot added the label CI failed on May 27, 2025
  10. DrahtBot commented at 2:06 pm on May 27, 2025: contributor

    🚧 At least one of the CI tasks failed. Task lint: https://github.com/bitcoin/bitcoin/runs/42967430443 LLM reason (✨ experimental): The CI failure is caused by a linting error due to a Python f-string missing placeholders.

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  11. DrahtBot removed the label CI failed on May 27, 2025
  12. ajtowns commented at 11:15 pm on May 27, 2025: contributor

    I think I’d intuitively prefer raw for storing them in the hash order, since that’s AFAICT the most common txid byte-string serialization

    Maybe raw and rawle then? For future consideration --spk=type so that it only records whether the spk was p2pk, p2pkh, p2sh, p2wpkh, p2wsh, p2tr, p2a, or other would probably be interesting.

  13. theStack force-pushed on May 28, 2025
  14. theStack commented at 5:14 pm on May 28, 2025: contributor

    I think I’d intuitively prefer raw for storing them in the hash order, since that’s AFAICT the most common txid byte-string serialization

    Maybe raw and rawle then?

    raw/rawle are a decent alternative that I’d slightly prefer over rawhash/raw, yes! Force-pushed accordingly. I’m fine with both though, so if other reviewers / potential users of that script feel strongly on rawhash/raw, I’m happy to revert.

    For future consideration --spk=type so that it only records whether the spk was p2pk, p2pkh, p2sh, p2wpkh, p2wsh, p2tr, p2a, or other would probably be interesting.

    Concept ACK. I think there is room for some bike-shedding on details, like e.g. do we split up p2pk into uncompressed and compressed (or, god forbid, even hybrid?) or whether bare multisig should also be a category (if at all, then probably only with a very weak detection like “first byte is in range OP_1…OP_16, last byte is OP_CHECKMULTISIG”, as implementing the full detection as used in Solver seems overblown for a conversion script). But, something to discuss in another PR anyways.

  15. in contrib/utxo-tools/utxo_to_sqlite.py:119 in 1b10e882f3 outdated
    113@@ -111,7 +114,9 @@ def main():
    114     parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
    115     parser.add_argument('infile', help='filename of compact-serialized UTXO set (input)')
    116     parser.add_argument('outfile', help='filename of created SQLite3 database (output)')
    117-    parser.add_argument('-v', '--verbose', action='store_true', help='show details about each UTXO')
    118+    parser.add_argument('--verbose', action='store_true', help='show details about each UTXO')
    119+    parser.add_argument('--spk', choices=['hex', 'raw'], default='hex', help='encode scriptPubKey as hex or raw bytes')
    120+    parser.add_argument('--txid', choices=['hex', 'raw', 'rawle'], default='hex', help='encode txid as hex, raw bytes (sha256 byteorder), or reversed raw bytes')
    


    w0xlt commented at 5:45 pm on May 28, 2025:

    Does rawle mean raw little-endian? If so, it would be better to be explicit in the description.

    0    parser.add_argument('--txid', choices=['hex', 'raw', 'rawle'], default='hex', help='encode txid as hex, raw bytes (sha256 byteorder), or reversed raw bytes (litle-endian)')
    

    Although rev_raw or something similar would also work fine.


    theStack commented at 0:56 am on May 29, 2025:
    Agree, done.
  16. contrib: utxo_to_sqlite.py: add options to store txid/spk as BLOBs
    Co-authored-by: Anthony Towns <aj@erisian.com.au>
    b30fca7498
  17. test: run utxo-to-sqlite script test with spk/txid format option combinations 7378f27b4f
  18. theStack force-pushed on May 29, 2025

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-06-14 15:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me