Fix ASM ambiguity #28824

pull willcl-ark wants to merge 7 commits into bitcoin:master from willcl-ark:asm-full-hex changing 34 files +360 −227
  1. willcl-ark commented at 4:07 pm on November 8, 2023: member

    Closes: #27795 Closes: #7996

    Previously ScriptToAsmStr returned hex-encoded values, except if data length was <= 4 bytes, in which case it displayed using decimal encoding.

    Fix the ASM representation by specifying an unambiguous encoding:

    • Drop OP_ prefix from all opcodes
    • For non-minimal pushes prefix with the opcode and enclose pushed hex value in angle brackets PUSHDATA1<0100>
    • For minimal pushes:
      • If > 5 bytes display pushed hex value enclosed in angle brackets <...>
      • If <= 5 bytes:
        • If minimally-encoded display pushed value in decimal without angle brackets 42
        • If not minimally-encoded display pushed hex value enclosed in angle brackets <4200>
      • Display undecodable bytes using UNDECODABLE(...)

    This should permit unambiguous decoding in the future.

  2. DrahtBot commented at 4:07 pm on November 8, 2023: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/28824.

    Reviews

    See the guideline for information on the review process. A summary of reviews will appear here.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #30408 (rpc: doc: use “output script” terminology consistently in “asm”/“hex” results by theStack)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  3. luke-jr referenced this in commit 08ab70da3c on Nov 9, 2023
  4. in test/functional/rpc_decodescript.py:136 in fde11cb0fa outdated
    132@@ -133,7 +133,7 @@ def decodescript_script_pub_key(self):
    133         cltv_script = '63' + push_public_key + 'ad670320a107b17568' + push_public_key + 'ac'
    134         rpc_result = self.nodes[0].decodescript(cltv_script)
    135         assert_equal('nonstandard', rpc_result['type'])
    136-        assert_equal('OP_IF ' + public_key + ' OP_CHECKSIGVERIFY OP_ELSE 500000 OP_CHECKLOCKTIMEVERIFY OP_DROP OP_ENDIF ' + public_key + ' OP_CHECKSIG', rpc_result['asm'])
    137+        assert_equal('OP_IF ' + public_key + ' OP_CHECKSIGVERIFY OP_ELSE 20a107 OP_CHECKLOCKTIMEVERIFY OP_DROP OP_ENDIF ' + public_key + ' OP_CHECKSIG', rpc_result['asm'])
    


    ajtowns commented at 4:12 am on November 9, 2023:
    This seems much harder to understand to me – you need to convert it to 0x07a120 then convert that to decimal to get back to the nice round number block height? In particular mistaking it for 0x20a107 (2,138,375) would be particularly misleading.
  5. in test/functional/wallet_send.py:173 in fde11cb0fa outdated
    169@@ -170,7 +170,7 @@ def test_send(self, from_wallet, to_wallet=None, amount=None, data=None,
    170                 else:
    171                     assert_greater_than(from_balance_before - from_balance, amount)
    172             else:
    173-                assert next((out for out in tx["vout"] if out["scriptPubKey"]["asm"] == "OP_RETURN 35"), None)
    174+                assert next((out for out in tx["vout"] if out["scriptPubKey"]["asm"] == "OP_RETURN 23"), None)
    


    ajtowns commented at 4:14 am on November 9, 2023:
    This also seems confusing, particularly if it appears in something like DUP SIZE 20 EQUALVERIFY HASH160 <x> EQUAL – whoops, that’s actually saying the input should be 32 bytes, not 20 bytes.
  6. MatthewLM commented at 9:32 am on November 9, 2023: none

    In the dart library coinlib (https://pub.dev/packages/coinlib) I followed the same approach of encoding everything in HEX with integers encoded as LE. However -1 is encoded as -1. See the code here: https://github.com/peercoin/coinlib/blob/master/coinlib/lib/src/scripts/operations.dart

    I did this to avoid changing the ASM too much. However, I’m happy to move towards using 0x for hex and presenting decimal for smaller data of up-to 32bits if this is introduced in Bitcoin. This would make the ASM easy to understand and read.

  7. willcl-ark commented at 4:25 pm on November 9, 2023: member

    @ajtowns @MatthewLM thank for your comments.

    I agree that after these changes some ASM fields will be less human-readable-friendly in LE hex.

    Prefixing with 0d was another possibility discussed previously, but there still exists a chance that someone will interpret these as hex-encoded values, as d is a valid hex char (whereas x is clearly not decimal or hex). e.g this is not necessarily that much clearer, but does offer some improvement:

    0$ bitcoin-cli -regtest decodescript 0511121314150457c74942 | jq .asm
    1"1112131415 0d1112131415"
    2
    3$ src/bitcoin-cli -regtest decodescript 0512345678900320A107b1 | jq .asm
    4"1234567890 0d500000 OP_CHECKLOCKTIMEVERIFY"
    

    Thinking about this more, it feels like the question to be answered is perhaps “is the asm field supposed to be human or machine readable?” If the former, then maintaining decimal for small values, and adding a hex-prefix for > 4 byte values might be most desirable, even it it leaves some ambiguity in some circumstances?

    0$ src/bitcoin-cli -regtest decodescript 0512345678900320A107b1 | jq .asm
    1"0x1234567890 500000 OP_CHECKLOCKTIMEVERIFY"
    

    This would create a moderately inconvenient interface in a few of the tests (and possibly some downstream users relying on this output) where hex prefixes, if present, have to be stripped from rpc output, but perhaps that’s not too terrible.

    If we are designing the asm output to be a canonical, machine-readable format then my opinion remains that using hex-only makes the most sense.

    FWIW btcdeb interprets all values as decimal but prints a warning that they are ambiguous:

    0$ btcc 1234567890 500000 OP_CHECKLOCKTIMEVERIFY
    1warning: ambiguous input 1234567890 is interpreted as a numeric value; use 0x1234567890 to force into hexadecimal interpretation
    2warning: ambiguous input 500000 is interpreted as a numeric value; use 0x500000 to force into hexadecimal interpretation
    304d20296490320a107b1
    4
    5$ btcc 0x1234567890 500000 OP_CHECKLOCKTIMEVERIFY
    6warning: ambiguous input 500000 is interpreted as a numeric value; use 0x500000 to force into hexadecimal interpretation
    70512345678900320a107b1
    

    And consequently generates a different script without manually specifying hex. I don’t know what other tools do by default.

  8. MatthewLM commented at 4:31 pm on November 9, 2023: none
    I agree that 0d is not a good idea, much better to go with 0x for hex.
  9. ajtowns commented at 8:39 pm on November 9, 2023: contributor

    Thinking about this more, it feels like the question to be answered is perhaps “is the asm field supposed to be human or machine readable?”

    I think it’s human readable? For machines we could just give the raw hex script and not bother with an asm format.

    Anyway, how about going for something more left field and having decodescript 0511121314150457c74942 output [1112131415] 1112131415 where the square brackets indicated “quoted hexidecimal”, and values without square brackets are always decimal numbers within CScriptNum range? That would still be pretty clear for pubkeys:

    02 [027d28a3580d0823dbee34f1df927ee58664eb8349e79b94017ddaf3f7e159153f] [03a3af0f49a21d29106ebeef9b3c3d69fc375c856a7153d97156a5b7a161ca6c25] 2 OP_CHECKMULTISIG
    

    Could perhaps also represent non-minimal pushes that way, eg the different encodings of -1 or [ff], something like:

    0 decodescript 4f -> -1
    1 decodescript 0181 -> [0:81]
    2 decodescript 4c0181 -> [1:81]
    3 decodescript 4d010081 -> [2:81]
    4 decodescript 4e0100000081 -> [4:81]
    
  10. willcl-ark commented at 12:57 pm on November 13, 2023: member

    I think it’s human readable? For machines we could just give the raw hex script and not bother with an asm format.

    This has always been my interpretation too.

    Anyway, how about going for something more left field…

    Square brackets is something I could get behind; to me it is clear from your example that the brackets are indicating something, and just by looking at them I can see in most cases that they’re hex values. To use the previous (less common?) example formatted with the brackets:

    0$ bitcoin-cli -regtest decodescript 0511121314150457c74942 | jq .asm
    1"[1112131415] 1112131415"
    

    IMO it’s again clear to me now that there’s something different about these two values, and if I didn’t know what it was then I should find out.

    One question I have is whether this will interfere with the scripthash types too badly? e.g. Is this nested bracket undesirable?

    0$ src/bitcoin-cli -regtest decoderawtransaction 0100000001696a20784a2c70143f634e95227dbdfdf0ecd51647052e70854512235f5986ca010000008a47304402207174775824bec6c2700023309a168231ec80b82c6069282f5133e6f11cbb04460220570edc55c7c5da2ca687ebd0372d3546ebc3f810516a002350cac72dfe192dfb014104d3f898e6487787910a690410b7a917ef198905c27fb9d3b0a42da12aceae0544fc7088d239d9a48f2828a15a09e84043001f27cc80d162cb95404e1210161536ffffffff0100e1f505000000001976a914eb6c6e0cdb2d256a32d97b8df1fc75d1920d9bca88ac00000000 | jq .vin[].scriptSig.asm
    1"[304402207174775824bec6c2700023309a168231ec80b82c6069282f5133e6f11cbb04460220570edc55c7c5da2ca687ebd0372d3546ebc3f810516a002350cac72dfe192dfb[ALL]] [04d3f898e6487787910a690410b7a917ef198905c27fb9d3b0a42da12aceae0544fc7088d239d9a48f2828a15a09e84043001f27cc80d162cb95404e1210161536]"
    

    specifically:

    0[... 2dfb[ALL]]
    

    It might be preferable to have this format instead?

    0[... 2dfb][ALL]
    
  11. MatthewLM commented at 1:37 pm on November 13, 2023: none
    0x is much more standard and recognisable than [].
  12. ajtowns commented at 11:09 pm on November 13, 2023: contributor

    0x is much more standard and recognisable than [].

    It’s pretty misleading though, since CScriptNum is little-endian: if you see 0x0100 you would expect that to be 256, but for us it would actually be non-minimally-encoded 1; For us, 0x0201 would actually be less than 0x0102, etc.

    Note this applies to block hashes, as well, so if you provide block 816651’s header and execute HASH256, you end up with [7d6368795d9675a198bd1dbe3d87dbdbbc3e0f3560a202000000000000000000]; if you were to write 0x00000000000000000002a260350f3ebcdbdb873dbe1dbd98a175965d7968637d to refer to the block hash, the same way we do in kernel/chainparams.cpp, you’d end up with the wrong thing. But you can’t have “0x” just reverse things either, as that would mess up 33-byte pubkeys (they’d now end with 02 or 03 instead of beginning with it) and signatures.

  13. ajtowns commented at 11:34 pm on November 13, 2023: contributor

    One question I have is whether this will interfere with the scripthash types too badly? e.g. Is this nested bracket undesirable?

    specifically:

    0[... 2dfb[ALL]]
    

    It might be preferable to have this format instead?

    0[... 2dfb][ALL]
    

    I guess it being inside the brackets seems slightly clearer to me? Could have different brackets for the two things, eg [... 2dfb<ALL>]. Could perhaps also drop the feature entirely (reverting #5264); note that we don’t display sighash flags for taproot signatures.

  14. MatthewLM commented at 11:05 am on November 14, 2023: none

    It’s pretty misleading though, since CScriptNum is little-endian: if you see 0x0100 you would expect that to be 256, but for us it would actually be non-minimally-encoded 1; For us, 0x0201 would actually be less than 0x0102, etc.

    Those smaller numbers will be shown in decimal still. For larger pushdata, hex is already shown in the little-endian encoded byte order. If 0x is too suggestive of a big-endian hex number (a fair point), then maybe bytes(123456789abcdef) is clearer and less ambiguous (maybe use push instead of bytes). If following that scheme, signatures could be something like sig(...2dfb, ALL). Possibly public keys could be pk(...).

  15. sipa commented at 1:28 pm on November 14, 2023: member

    Do we actually want to resolve all forms of ambiguity? E.g. what about the distinction between an OP_1 vs. a direct push of 0x01, vs. OP_PUSHDATA1 of 0x01? I think it’d be nice if there was a one-to-one correspondence between the disassembly and the actual script bytes.

    My contribution to this bikeshedding would be:

    • OP_n -> n (so -1, 0, 1, …, 10, …, 16).
    • Drop “OP_” prefix for other opcodes too; they’re redundant.
    • Other minimal pushes (so direct pushes, as well as OP_PUSHDATA1 over 75 bytes, and OP_PUSHDATA2 over 255 bytes):
      • If the data pushed is a minimally-encoded positive number between 0 and 2^39-1 (OP_CLTV / OP_CSV accept 5-byte numbers) -> <0x12345678> (BE hex, as that’s what a human expects for a numeric value with 0x).
      • Otherwise, <1234567890ABCDEF> (in wire byte order, no 0x because not attempting to convey a number)
    • For non-minimal pushes, use PUSHDATA1<...> or PUSHDATA2<...> (with the contents of ... using the same rules as for minimal pushes).
    • Drop the sighash byte type decoding. I don’t feel it matches what asm is trying to do.
    • Optionally, for P2SH(-like) scriptSigs, the final redeemScript push could be recursively decoded (so inside <...>, put the asm decoding of the script being pushed), while remaining fully unambiguous.
  16. ajtowns commented at 3:50 am on November 15, 2023: contributor

    My contribution to this bikeshedding would be: * OP_n -> n (so -1, 0, 1, …, 10, …, 16). * Drop “OP_” prefix for other opcodes too; they’re redundant. * Drop the sighash byte type decoding. I don’t feel it matches what asm is trying to do.

    :+1:

    • For non-minimal pushes, use PUSHDATA1<...> or PUSHDATA2<...> (with the contents of ... using the same rules as for minimal pushes).

    Also PUSHDATA4<...> for completeness, presumably. A direct push of values -1 and 1..16 is also non-minimal (ie, 0101 instead of 51), could perhaps use PUSHDATA<..> (without 1, 2 or 4) for that case.

    • Other minimal pushes (so direct pushes, as well as OP_PUSHDATA1 over 75 bytes, and OP_PUSHDATA2 over 255 bytes):
    • If the data pushed is a minimally-encoded positive number between 0 and 2^39-1 (OP_CLTV / OP_CSV accept 5-byte numbers) -> <0x12345678> (BE hex, as that’s what a human expects for a numeric value with 0x).

    Provided you’ve already dealt with non-minimally encoded -1 and 1..16, using decimals seems nicer here? You can cope with negatives cleanly (-2 vs <82>), and it’s one less encoding variation?

    • Optionally, for P2SH(-like) scriptSigs, the final redeemScript push could be recursively decoded (so inside <...>, put the asm decoding of the script being pushed), while remaining fully unambiguous.

    That would be nice. Probably requires remembering info from the scriptPubKey and using it when decoding the scriptSig or witness.

  17. MatthewLM commented at 9:29 am on November 15, 2023: none

    Provided you’ve already dealt with non-minimally encoded -1 and 1..16, using decimals seems nicer here? You can cope with negatives cleanly (-2 vs <82>), and it’s one less encoding variation?

    I was thinking it would be nice to retain decimal but, due to ambiguity that could arise with CLTV and CSV, I’m not sure.

    Having embedded ASM for redeem scripts and Tapscripts would be very useful.

  18. ajtowns commented at 10:35 am on November 15, 2023: contributor

    I was thinking it would be nice to retain decimal but, due to ambiguity that could arise with CLTV and CSV, I’m not sure.

    What ambiguity?

  19. MatthewLM commented at 12:31 pm on November 15, 2023: none
    For 32-bit signed integers, the 32nd leftmost bit is negative, but that’s not true for CLTV and CSV which uses 5 bytes. So a push data could represent a negative 32-bit signed integer, or a positive 40-bit signed integer if I understand the CScriptNum without looking more closely at it.
  20. sipa commented at 12:33 pm on November 15, 2023: member
    @MatthewLM I don’t think that’s correct. What’s special in CSV and CLTV is that they permit numbers whose encoding is up to 5 bytes (rather than the math opcodes which only support encodings up to 4 bytes on input). But the encoding algorithm is the same: the top bit is the sign.
  21. MatthewLM commented at 12:36 pm on November 15, 2023: none
    @sipa I’ll have to look at the code. I imagine it is two’s complement? Then the negative bit will be different for each type of integer meaning that a 4 byte pushdata that would encode a negative 32-bit signed integer could also be interpreted as a positive 40-bit signed integer.
  22. sipa commented at 12:47 pm on November 15, 2023: member

    @ajtowns Ok, iterating on that, new proposal:

    • Any minimally-encoded number, using a minimal push (OP_n, or direct push for numbers that cannot be encoded using OP_n) of an integer between 231 and 239-1, is just encoded in decimal directly (no <> or anything).
    • OP_ prefix dropped in general
    • Drop sighash type.
    • Any other minimal pushes (so direct push <= 75 bytes, PUSHDATA1 > 75 bytes, PUSHDATA2 > 255 bytes) become <...> with … the hex-encoding (in wire byte order) of the push.
    • Non-minimal pushes become PUSHDATA1<...>, PUSHDATA2<...>, or PUSHDATA4<...>.
    • Inside <...> instead of having hex data, it’s also allowed to put another decoded script (as long the script doesn’t consist of exactly one 2-digit number push, because that’d be ambiguous…).

    There is no need for PUSHDATA<...> here, because a direct push of 3 for example (0103 in hex) would become <03>, while OP_3 would become 3. @MatthewLM I think you’re missing something, but I don’t see what. There is no ambiguity. Every byte encoding represents exactly one number, with a well-defined sign.

  23. MatthewLM commented at 12:56 pm on November 15, 2023: none

    @sipa Sorry, you are right. I looked at the code again. Bitcoin doesn’t use two’s complement for this (I have to keep in mind that Bitcoin has a lot of quirks). It uses a sign bit that negates the other bits, and this sign bit depends on the size of the push data (whatever the last byte is), so there is no ambiguity.

    Therefore I think minimally pushed numbers ought to be displayed as decimal for readability reasons as you now suggest.

    • Inside <...> instead of having hex data, it’s also allowed to put another decoded script (as long the script doesn’t consist of exactly one 2-digit number push, because that’d be ambiguous…).

    Or any other decimal that has an even number of digits. Maybe use something like SCRIPT<...>?

  24. sipa commented at 1:01 pm on November 15, 2023: member
    @MatthewLM Right, any single numeric push of even number of digits inside <> would be ambiguous. But also, scripts consisting of a single push are generally useless as a redeemscript (they’d be anyone-can-spend), so I’m not sure it’s an issue. If you actually encounter one of those, just don’t do recursive decoding.
  25. sipa commented at 1:24 pm on November 15, 2023: member

    One more thought: if the format we end up with is unambiguous, there could also be a function/RPC/tool to convert asm back to script (and we can have fuzz tests that the roundtripping encoding+decoding is exact!). If so, we can probably permit a few things in the decoding that the encoder might not (yet) support:

    • Optionally OP_ prefixing opcodes (and using aliases like TRUE or OP_TRUE for OP_1).
    • Using 0x... for BE hex encoded numbers in all places where a decimal number is supported (so outside <>).
    • PUSHDATA1<...> or PUSHDATA2<...> notation even when the push is minimal.
    • Recursive decoding (if not supported in the encoder in the first iteration).
  26. willcl-ark commented at 1:25 pm on November 15, 2023: member

    I like where this is going now. Thanks for the conversation.

    I’ve written a branch implementing this previous version as I wanted to see what the changes would look like (although I’m not convinced I’m handling OP_1NEGATE properly yet, even though tests are passing…)

    Any minimally-encoded number, using a minimal push (OP_n, or direct push for numbers that cannot be encoded using OP_n) of an integer between 231 and 239-1, is just encoded in decimal directly (no <> or anything).

    I think this will make the representation even nicer to interpret and will gladly make this change.

    Inside <…> instead of having hex data, it’s also allowed to put another decoded script (as long the script doesn’t consist of exactly one 2-digit number push, because that’d be ambiguous…).

    I’ll take a stab at implementing these nested scripts as well, as I think they’d be a nice improvement too…

  27. ajtowns commented at 1:49 pm on November 15, 2023: contributor
    • Any minimally-encoded number, using a minimal push (OP_n, or direct push for numbers that cannot be encoded using OP_n) of an integer between 231 and 239-1, is just encoded in decimal directly (no <> or anything).

    -2**31 -2**31+1 and 2**39-1 presumably? (original lacks the negative sign, and github can’t quote superscript correctly apparently). EDIT: 0xffff_ffff is -2**31+1 with a sign-flag, rather than -2**31 being 0x8000_0000 in two’s complement

    • Any other minimal pushes (so direct push <= 75 bytes, PUSHDATA1 > 75 bytes, PUSHDATA2 > 255 bytes) become <...> with … the hex-encoding (in wire byte order) of the push.

    Per script/script.cpp:CheckMinimalPush(), 0101 isn’t a minimal push, so I think it’d be better to drop the “minimal push” phrasing and just use the definition currently in brackets. (also; PUSHDATA4 > 65535 bytes)

    • Inside <...> instead of having hex data, it’s also allowed to put another decoded script (as long the script doesn’t consist of exactly one 2-digit number push, because that’d be ambiguous…).

    Any push of a single positive number between 10-99; 1,000-9,999; 100,000-999,999; 10,000,000-99,999,999; 1,000,000,000-9,999,999,999; and 100,000,000,000-549,755,813,887 would be ambiguous, no?

  28. sipa commented at 2:00 pm on November 15, 2023: member
    @ajtowns Right on all 3 counts. Instead of limiting the integer range, maybe just “all minimal pushes of up to 5 bytes, which push a minimally-encoded integer”.
  29. sipa commented at 2:05 pm on November 15, 2023: member

    One more question: what to do with unknown opcodes, and undecodable bytes? The current FormatScript just turns it into bare hex with 0x prefix. I guess we could keep the “0x” prefix to denote “raw stuff”, but the current format is pretty misleading (it uses wire byte order, while “0x” suggests they’re BE numbers) and incompatible with allowing “0x…” for denoting actual integers.

    Suggestion: use RAW(...) with “…” the hex in wire byte order for those. I’d prefer not to use RAW<...> because I associate <> with “pushing”, which wouldn’t be correct here. I’m not too happy with having a special keyword in the syntax just for that either, so suggestions welcome.

  30. willcl-ark commented at 2:14 pm on November 15, 2023: member

    Currently GetOpCode() will return OP_UNKNOWN for unknown opcodes which seems reasonable. Dropping the OP_ would appear in the asm as UNKNOWN.

    As for undecodable bytes, if we did want a keyword, perhaps something more explicit like UNDECODABLE<...> would be better? We are essentially using what I would expect RAW to denote for > 5 byte minimal pushes if we go with this:

    Any other minimal pushes (so direct push <= 75 bytes, PUSHDATA1 > 75 bytes, PUSHDATA2 > 255 bytes) become <…> with … the hex-encoding (in wire byte order) of the push.

  31. sipa commented at 2:18 pm on November 15, 2023: member

    Currently GetOpCode() will return OP_UNKNOWN for unknown opcodes which seems reasonable.

    That’s trivially ambiguous, as all unknown opcodes are mapped to the same thing.

    As for undecodable bytes, if we did want a keyword, perhaps something more explicit like UNDECODABLE<…> would be better?

    Maybe, but I wouldn’t use <...> because it’s not a push.

    We are essentially using what I would expect RAW to denote for > 5 byte minimal pushes if we go with this:

    No, any pushes become either decimal numbers, or <...>, or PUSHDATA[124]<...>. Pushes are not undecodable.

    EDIT: actually, I think all undecodable things (ignoring unknown opcodes) are pushes (but ones which straddle the end of the script boundary).

  32. MatthewLM commented at 4:10 pm on November 15, 2023: none
    Could we use UNKNOWN_n where n denotes the op code hex/decimal such as UNKOWN_FE for 0xfe?
  33. willcl-ark marked this as a draft on Nov 15, 2023
  34. willcl-ark force-pushed on Nov 15, 2023
  35. willcl-ark force-pushed on Nov 15, 2023
  36. willcl-ark commented at 10:33 pm on November 15, 2023: member

    Marked as draft for now and pushed what I have worked on so far.

    I haven’t looked at implementing undecodable bytes or nested scripts yet.

  37. in src/core_write.cpp:137 in 89547bb05f outdated
    151-                    }
    152-                    str += HexStr(vch) + strSigHashDecode;
    153+            if (CheckMinimalPush(vch, opcode)) {
    154+                if (vch.size() <= 5 ) {
    155+                    // Return decimal for minimially-encoded values <= 5 bytes (OP_CLTV / OP_CSV accept 5-byte numbers)
    156+                    CScriptNum n(vch, false, 5);
    


    sipa commented at 11:53 pm on November 15, 2023:
    I think you want fRequireMinimal=true true here, otherwise it’ll turn non-minimally-encoded values into decimal too.

    willcl-ark commented at 11:28 am on November 16, 2023:
    Agree that it should be true as a belt-and-suspenders, but we are inside if (CheckMinimalPush(vch, opcode)) already here, so shouldn’t happen?

    ajtowns commented at 11:47 am on November 16, 2023:
    Byte code 024000 is a minimal push of the two-byte vector 4000 (so passes CheckMinimalPush) but it’s not a minimal CScriptNum since it just evaluates to 64, and should have been 0140 (so fails fRequireMinimal due to the trailing 0).

    sipa commented at 12:29 pm on November 16, 2023:

    Yeah, there are two distinct concepts of minimality:

    • Minimal pushes apply to the interpretation of stack elements as byte vectors, and are about whether you used the right opcode (OP_n, direct push, OP_PUSHDATA1, OP_PUSHDATA2, OP_PUSHDATA4): you have to use the first applicable one from that list for a push to be a minimal push. It’s relevant even for non-numeric data in script (e.g. you shouldn’t use OP_PUSHDATA1 to push a public key, as a direct push suffices).
    • Minimally-encoded integers apply to the interpretation of stack elements as numbers, and are about whether no surplus bytes were introduced when encoding the number as a byte vector but not about how that byte vector gets pushed. The byte vectors 42, 4200, 420000 all encode the number 66, and 83, 0380, 030080, … all encode the number -3, but you have to use the first one from each list for it to be minimally-encoded.

    willcl-ark commented at 2:48 pm on November 16, 2023:

    Thanks for the explanations.

    I’ve refactored out handling pushed data, as I wasn’t enjoying all the nested if statements, and notice that we are currently handling “non-minimal pushes” the same as minimal pushes with non-minimally-encoded values, which is to use OPCODE<hex>. This seems ok, and it will not be a lossy decode, but do we want to differentiate between the two pathways to this decode format?

    See first and last case in snippet to illustrate what I mean above:

     0/*
     1 * Format pushed bytes into appropriate ASM format
     2 */
     3std::string FormatPushDataAsm(const std::vector<unsigned char>& vch, opcodetype opcode)
     4{
     5    // Use OPCODE<hex> for non-minimal pushes
     6    // TODO: There are no tests for this currently
     7    if (!CheckMinimalPush(vch, opcode)) {
     8        return GetOpNameAsm(opcode) + BracketStr(HexStr(vch));
     9    }
    10
    11    // Use <hex> for minimal pushes > 5 bytes
    12    if (!(vch.size() <= 5) ) {
    13        return BracketStr(HexStr(vch));
    14    }
    15
    16    // Use decimal for minimally-encoded, minimal pushes <= 5 bytes
    17    // Note: OP_CLTV / OP_CSV accept 5-byte numbers
    18    try {
    19        CScriptNum n{vch, true, 5};
    20        return strprintf("%lld", n.GetInt64());
    21    }
    22
    23    // Use OPCODE<hex> for non-minimally-encoded minimal pushes
    24    // TODO: There are no tests for this currently
    25    catch (scriptnum_error& e) {
    26        return GetOpNameAsm(opcode) + BracketStr(HexStr(vch));
    27    }
    28}
    

    sipa commented at 2:50 pm on November 16, 2023:
    The <...> and OPCODE<...> cases treat the pushed values as byte arrays, not as numbers, so minimal encoding or not is irrelevant.

    willcl-ark commented at 2:54 pm on November 16, 2023:

    Sure, but what I mean is that in my current impl there are two ways we can end up with OPCODE<...>:

    1. Wasn’t a minimal push
    2. Was a minimal push <5 bytes, but wasn’t a minimally-encoded decimal value

    sipa commented at 3:03 pm on November 16, 2023:

    Hmm, right. I think that’s redundant.

    I’d go for something like:

    • If minimal push, not over 5 bytes, and minimally encoded, use decimal.
    • Otherwise, if minimal push, or direct push, use <...>.
    • Otherwise, use OPCODE<...>.

    sipa commented at 3:16 pm on November 16, 2023:

    This is unambiguous, with the following decoding rules:

    • Decimal numbers are turned into OP_n if applicable, otherwise into a direct push of the minimally-encoded form of that number.
    • <...> is turned into a direct push if up to 75 bytes, into OP_PUSHDATA1 if below 256 bytes, into OP_PUSHDATA2 if below 65536 bytes, and into OP_PUSHDATA4 otherwise.
    • OPCODE<...> is turned into a push using the relevant opcode.

    In particular (I think @ajtowns pointed this out before as well), using <...> for non-minimal but direct pushes is fine, as it cannot represent anything else (not an OP_n, because those use decimal, and not a PUSHDATA opcode because if those were used for sizes <= 75 bytes, they’d have used OPCODE<...>).


    ajtowns commented at 2:40 am on November 17, 2023:
    Yeah. This just means changing the final catch to return BracketStr(HexStr(vch));, no?

    ajtowns commented at 2:41 am on November 17, 2023:

    Also,

    0if (!(vch.size() <= 5) ) {
    

    Can I introduce you to > ? :smile:

  38. in src/core_write.cpp:140 in 89547bb05f outdated
    155+                    // Return decimal for minimially-encoded values <= 5 bytes (OP_CLTV / OP_CSV accept 5-byte numbers)
    156+                    CScriptNum n(vch, false, 5);
    157+                    str += strprintf("%lld", n.GetInt64());
    158                 } else {
    159-                    str += HexStr(vch);
    160+                    // Otherwise display the push as LE hex enclosed in angle brackets
    


    sipa commented at 0:07 am on November 16, 2023:
    I expect this to be a controversial opinion, but I disagree with calling this “little endian”. That’s a term that applies to encoding numbers to bytes/bits. Since what’s being encoded isn’t a number, or to be interpreted as a number, endianness is inapplicable. It’s just encoding bytes in hexadecimal format, in order.

    luke-jr commented at 0:11 am on November 16, 2023:
    It’s not little endian, I agree.
  39. sipa commented at 1:35 pm on November 17, 2023: member

    Brainstormy idea for undecodable scripts.

    These are necessarily pushes (direct or OP_PUSHDATA[124]) whose prescribed length exceeds the end of the script. What about using <...+N> or OPCODE<...+N>, where ... is hexadecimal as before, and N is a decimal number indicating how many bytes are missing?

    EDIT: for OP_PUSHDATA[124] it’s also possible that the 1-4 byte length field itself straddles the script end, and that isn’t covered by this suggestion.

  40. willcl-ark commented at 2:06 pm on November 17, 2023: member

    Brainstormy idea for undecodable scripts.

    Currently failure to fully parse in GetScriptOp will only return false if we run out of bytes to read either the length, or from the operand, and we don’t store the opcode or the available remaining bytes from the iterator in opcodeRet or pvchRet. But we could change GetScriptOpt to store them the failure case too.

    <...+N> or OPCODE<...+N>

    I don’t think it makes much difference which flavour is chosen here; both would require the N to be parsed to calculate the total length and therefore determine the opcode that was used before it was encoded. As there are bytes missing, I don’t think we can tell if this was a minimal push or not either. I’d probably slightly lean towards using OPCODE<...+N> as it’s slightly more human-friendly?

  41. sipa commented at 2:08 pm on November 17, 2023: member

    I’d probably slightly lean towards using OPCODE<...+N> as it’s slightly more human-friendly?

    In case it’s a direct push there is no opcode at all. I didn’t mean this as a flavor to choose from; we need both.

  42. ajtowns commented at 1:03 am on November 22, 2023: contributor

    Brainstormy idea for undecodable scripts.

    These are necessarily pushes (direct or OP_PUSHDATA[124]) whose prescribed length exceeds the end of the script. What about using <...+N> or OPCODE<...+N>, where ... is hexadecimal as before, and N is a decimal number indicating how many bytes are missing?

    EDIT: for OP_PUSHDATA[124] it’s also possible that the 1-4 byte length field itself straddles the script end, and that isn’t covered by this suggestion.

    If you’re adding extra encoding that need to be parsed (+N), maybe just have UNPARSABLE<hex> or GARBAGE<hex> to capture the trailing data that wasn’t a full push? Seems less likely to be confusing (easy to miss a +30 at the end of a hex string), and works even if we change an OP_SUCCESS into a multibyte opcode in future?

  43. sipa commented at 1:08 am on November 22, 2023: member

    Yeah, my only (rather minor) reservation about UNPARSEABLE<...> is that I think of <...> as a way of indicating “stuff being pushed”, which wouldn’t be the case here. And if we’re going to introduce some special syntax for unparseable stuff anyway, I thought it might be useful to integrate into the push syntax itself (as everything unparseable is an incomplete push).

    That said, I agree that the +N thing isn’t super readabable, and it’s incomplete anyway (can’t deal with OP_PUSHDATA[124] whose length descriptor itself is missing).

    So maybe just UNPARSEABLE(...)?

  44. ajtowns commented at 3:58 am on November 22, 2023: contributor

    Yeah, my only (rather minor) reservation about UNPARSEABLE<...> is that I think of <...> as a way of indicating “stuff being pushed”,

    Yeah, I guess being able to scan and say <...> means data is going onto the stack makes sense.

    So maybe just UNPARSEABLE(...)?

    Could do UNPARSEABLE:04010203 or something (colon instead of brackets).

    (Arguably, everything after the first OP_SUCCESS in a tapscript could also be considered trailing hex that shouldn’t be decoded. Probably more useful to decode it anyway, though)

  45. DrahtBot added the label CI failed on Jan 16, 2024
  46. Luisjdj approved
  47. maflcko commented at 7:51 am on March 11, 2024: member
    Are you still working on this?
  48. script: don't decode sighash flags in ASM repr
    We don't decode these on taproot signatures already, modify non-taproot
    signatures to match.
    2bf4085ab0
  49. willcl-ark force-pushed on Mar 12, 2024
  50. script: add GetOpNameAsm helper fn
    Strip unneccesary leading "OP_" prefix from opcodes when they are
    rendered as ASM.
    1c3468e77e
  51. willcl-ark force-pushed on Mar 12, 2024
  52. willcl-ark renamed this:
    Use LE hex-encoded representations in script ASM for pushed values <= 4 bytes
    Fix ASM ambiguity
    on Mar 12, 2024
  53. DrahtBot removed the label CI failed on Mar 13, 2024
  54. script: drop OP_ prefix from ASM repr 88b6115eca
  55. test: add asm repr asm_hex helper fn
    A simple helper function to angle bracket a hex string.
    7695272fcd
  56. script: remove ambiguity from ASM repr
    * Return minimal pushes < 5 bytes as decimal
    * Return non-minimal pushes < 5 bytes wrapped as OP_CODE<...>
    * Return pushes > 5 bytes as hex wrapped as <...>
    614d25c56d
  57. script: return undecodable bytes b3a8c40571
  58. doc: release note for asm repr script changes 4890626156
  59. willcl-ark force-pushed on Mar 13, 2024
  60. sipa commented at 2:17 pm on May 8, 2024: member
    Attempt to revive the discussion: #27795 (comment)
  61. DrahtBot added the label Needs rebase on Jul 23, 2024
  62. DrahtBot commented at 8:33 pm on July 23, 2024: contributor

    🐙 This pull request conflicts with the target branch and needs rebase.

  63. DrahtBot commented at 1:01 am on October 20, 2024: contributor

    ⌛ There hasn’t been much activity lately and the patch still needs rebase. What is the status here?

    • Is it still relevant? ➡️ Please solve the conflicts to make it ready for review and to ensure the CI passes.
    • Is it no longer relevant? ➡️ Please close.
    • Did the author lose interest or time to work on this? ➡️ Please close it and mark it ‘Up for grabs’ with the label, so that it can be picked up in the future.
  64. Madxgame approved

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-23 18:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me