Remove Ambiguity of Script ASM Hex and Decimal Integer Representations #27795

issue MatthewLM openend this issue on May 31, 2023
  1. MatthewLM commented at 11:14 am on May 31, 2023: none

    Please describe the feature you’d like to see added.

    Distinguish between decimal and hex integers in the script ASM to avoid ambiguity. Hex integers/data can be prefixed with 0x to avoid ambiguity.

    When scripts are decoded into ASM, two different integers can be displayed identically, with one as hex and the other as decimal. decodescript 0511121314150457c74942 produces ASM of 1112131415 1112131415 , despite the integers being different. There is no distinction between hex and decimal.

    Describe the solution you’d like

    Prefixing 0x makes the most sense to me.

    Describe any alternatives you’ve considered

    All data could be provided as hex, or “d” could be added to decimals. Removing ambiguity is the primary concern here.

    Please leave any additional context

    No response

  2. MatthewLM added the label Feature on May 31, 2023
  3. maflcko added the label RPC/REST/ZMQ on May 31, 2023
  4. fanquake commented at 12:49 pm on May 31, 2023: member
    Related to #7996.
  5. darosior commented at 2:04 pm on June 1, 2023: member

    When scripts are decoded into ASM, two different integers can be displayed identically, with one as hex and the other as decimal.

    It’s not two different integers, it’s that integers (which all <=4 bytes pushes are assumed to be) are displayed in decimal and all other pushes are displayed as hexadecimal.

    I wonder why we even try to detect and print numbers differently (and even more so in a different base!), just print all pushes in hex and there is no ambiguity anymore?

  6. MatthewLM commented at 2:21 pm on June 1, 2023: none

    They are two different numbers. One is 0x1112131415 and the other is 0x57c74942 (encoded as little-endian). In ASM, they are displayed the same. Using 0x before hex is consistent with hex literals elsewhere and would allow for decimals to be displayed including -1 for OP_1NEGATE.

    Displaying all numbers as hex is an option. One problem is that they are given as little-endian which could be confusing for some wishing to interpret them. Off the top of my memory, the scripting interpreter treats 32-bit numbers arithmetically so I think it makes sense to display those separately. The only difference should be that hex and decimal should be distinguished.

  7. darosior commented at 2:28 pm on June 1, 2023: member

    Sure, i misspoke. I just meant:

    0>>> int.from_bytes(bytes.fromhex("57c74942"), "little")
    11112131415
    
  8. willcl-ark commented at 8:59 am on June 2, 2023: member

    It’s a little unclear to me how the ASM output is used on the consumer side, and therefore how breaking any changes to it might be. However I still think we have a few options to resolve this:

    1. Fully hex-encoded output in the ASM repr:
    0$ /home/will/src/bitcoin/src/bitcoin-cli -regtest decodescript 0511121314150457c74942
    1{
    2  "asm": "1112131415 57c74942",
    
    1. Prefixing hex values with 0x (we would end up with a lot of prefixes in this case):
    0$ /home/will/src/bitcoin/src/bitcoin-cli -regtest decodescript 0511121314150457c74942
    1{
    2  "asm": "0x1112131415 1112131415",
    
    1. Prefixing decimal values with 0d (only prefixing small pushes):
    0$ /home/will/src/bitcoin/src/bitcoin-cli -regtest decodescript 0511121314150457c74942
    1{
    2  "asm": "1112131415 0d1112131415",
    

    As I think Antoine inticated, I’d have a preference for “full hex”. However I think there may be a small quality of life advantage to prefixing decimal values for persons manually crafting or reading scripts using ASM notation, as e.g. timelock values are most easily thought about in decimal.

    Edit: Curious if anyone knows the original rationale for displaying small values as decimal, it was carved out by satoshi.

  9. willcl-ark referenced this in commit ce5f9455f2 on Jun 5, 2023
  10. MatthewLM commented at 12:37 pm on June 5, 2023: none

    Edit: Curious if anyone knows the original rationale for displaying small values as decimal, it was carved out by satoshi.

    It is because any numerical/arithmetic operations only allow 32-bit values (please see the CScriptNum class), so the ASM displays 32-bit values as decimal and longer values as hex. This makes sense as humans generally better comprehend decimal values when thinking arithmetically, but the values need to be distinguishable.

  11. willcl-ark commented at 2:11 pm on June 5, 2023: member

    Well, they would still be 32 bit values, just hex-encoded? But I take the point. I personally think prefixing hex with 0x would be kind of horrible, as this would cascade into including pubkeys for example.

    Is prefixing decimals with 0d the best approach then?

    I am still unclear on how much impact this would have downstream, and therefore whether any of these discussed changes would be accepted in any form. It seems like this could break much of the little scripting tooling we have available for bitcoin…

  12. MatthewLM commented at 3:20 pm on June 5, 2023: none

    Is it a problem prefixing public keys with 0x? I’m not overly fussed on the solution as long as the ambiguity is removed.

    It seems like this could break much of the little scripting tooling we have available for bitcoin

    If these tools read these ambiguous numbers/data from ASM, then they are already broken, so this needs fixing one way or another.

  13. ajtowns commented at 7:03 am on August 25, 2023: contributor

    Maybe https://github.com/ajtowns/bitcoin/commit/36cbf115bda9fb2a28c1f0ec119d6108218389c9 ? It adds a “0x” if the hex is ambiguous (ie none of the digits are a-f). Presumably almost all pubkeys will not be ambiguous (1-in-12-trillion chance) so the 0x won’t get added there and confuse people.

    0$ ./bitcoin-cli -regtest decodescript 0511121314150457c74942 | jq .asm
    1"0x1112131415 1112131415"
    2$ ./bitcoin-cli -regtest decodescript 0511121c14150457c74942 | jq .asm
    3"11121c1415 1112131415"
    

    Another approach could be to add the prefix for hex strings of 5 bytes (10 hex digits) – anything that would decode to 8 or fewer hex digits would be decoded as decimal anyway, and anything with >11 decimal digits would be more than 4 bytes originally so would not have been decoded as decimal.

  14. willcl-ark referenced this in commit 8cc430b8b8 on Nov 8, 2023
  15. willcl-ark referenced this in commit d6c0c4bef6 on Nov 8, 2023
  16. luke-jr commented at 9:56 pm on November 8, 2023: member
    Even aside from the main issue, it’s still ambiguous for the various opcode possibilities… :/
  17. luke-jr referenced this in commit 08ab70da3c on Nov 9, 2023
  18. sipa commented at 2:16 pm on May 8, 2024: member

    The discussion here led to #28824 being opened, where more discussion happened, and the idea evolved further. Having re-read things there, I have a concrete proposal that perhaps goes a bit further, so I’m posting it here to keep PR discussion about the implementation.

    The goal is addressing ambiguity and readability, but not necessarily compatibility (because being compatibility with something broken is silly).

    Decoding rules

    Even if all we expose on the RPC/user side is encoding, I believe it’s useful to formally specify the decoding rules, as they help understand how the encoding can be unambiguous. If this is implemented, I think we want the decoder written too, so that e.g. round-trip fuzz tests can be added for the encoding.

    • If the asm string in its entirety is an even length hex string, it is decoded raw (i.e. script.size() == asm.size() / 2 exactly).
    • Otherwise, the script is interpreted as consisting of 1 or more whitespace-separated units, which are each decoded as listed below and concatenated:
      • An opcode name, excluding OP_PUSHDATAn, with or without OP_ prefix, is decoded to the single opcode byte.
      • A decimal number, optionally prefixed with + or -, is decoded as the script consisting of the minimum push of the minimal encoding of that number (i.e. OP_1NEGATE, OP_0 through OP_16, or direct push), assuming that encoding is no more than 5 bytes. For 0 through 16 this overlaps with the previous rule, but since they mean the same thing this is ok.
      • A string surrounded by < and > means a direct push or OP_PUSHDATAx push (whichever is smaller) of the recursively-decoded string between angle brackets. So that string can be just hex data, or it can be another sequence of units.
      • A string surrounded by PUSHDATAx< and > means a push using the specified OP_PUSHDATAx push of the recursively-decoded string between angle brackets.
      • An even length hex string prefixed with “#” means raw hex decoding that string.

    Notably, this enables recursive encoding (e.g. a 2-of-3 multisig P2SH scriptSig could become 0 <sig1> <sig3> <2 <pub1> <pub2> <pub3> CHECKMULTISIG>) .

    The ambiguity of asm scripts that consists entirely of an even number of decimal characters is resolved in favor of treating them as raw hex rather than as a decimal integer, because:

    • Scripts that entirely consist of a single push are uninteresting and presumably rare (they’re anyone-can-spend).
    • This means that the rules for decoding what’s inside <> are exactly the same as decoding scripts.

    Compared to the earlier discussion, it uses # instead of UNPARSABLE:, GARBAGE:, RAW() or any of those. Happy to bikeshed this; it shouldn’t occur frequently.

    It drops 0x things entirely, because I think the ones we have in the asm format currently are nonsensical (sometimes they are raw bytes, sometimes they are pushes, but there they use raw byte order while 0x to me would generally mean human-ordered (big endian) numbers (as they do in C and Python). If people feel compatibility with the 0x stuff is desirable, it could be added back, but I’d rather avoid it. Alternatively, 0x for numeric things would make more sense to me (e.g. 0x12345 meaning the script whose hex bytes are “03 45 23 01”), but that’s an even bigger break with what we had before.

    Encoding rules

    The format allows multiple possible encodings for the same thing (e.g. script byte 00 aka OP_0 can be written as <>, 0, OP_0, or #00), so the encoding is not unique. This is a suggested approach:

    • For pushes, try in order:
      • If just a decimal integer encodes the right thing, use that. This is always the case for (among others) OP_1NEGATE, and OP_0 through OP_16. A + or OP_ in front is only needed if the entire script consists of just a single push whose decimal encoding would have even length otherwise (because confusion with entirely hex-encoded string rule).
      • If the push is minimal otherwise, use < + hex encoding of the pushed data + >. If the entire script is push-only, and the last push looks like a script itself, use < + disassembly of the pushed script + > instead.
      • Use PUSHDATAx< + hex encoding of pushed data + > otherwise.
    • Other opcodes are just turned into their name.
    • Unparsable data becomes # + raw hex encoding.

    If everything, or a large fraction of the script, is unparsable just hex encode instead.


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-09-28 22:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me