Remove Ambiguity of Script ASM Hex and Decimal Integer Representations

MatthewLM commented at 11:14 am on May 31, 2023: none

Please describe the feature you’d like to see added.

Distinguish between decimal and hex integers in the script ASM to avoid ambiguity. Hex integers/data can be prefixed with 0x to avoid ambiguity.

When scripts are decoded into ASM, two different integers can be displayed identically, with one as hex and the other as decimal. decodescript 0511121314150457c74942 produces ASM of 1112131415 1112131415 , despite the integers being different. There is no distinction between hex and decimal.

Describe the solution you’d like

Prefixing 0x makes the most sense to me.

Describe any alternatives you’ve considered

All data could be provided as hex, or “d” could be added to decimals. Removing ambiguity is the primary concern here.

Please leave any additional context

No response

MatthewLM added the label Feature on May 31, 2023

maflcko added the label RPC/REST/ZMQ on May 31, 2023

fanquake commented at 12:49 pm on May 31, 2023: member

Related to #7996.

darosior commented at 2:04 pm on June 1, 2023: member

When scripts are decoded into ASM, two different integers can be displayed identically, with one as hex and the other as decimal.

~~It’s not two different integers,~~ it’s that integers (which all <=4 bytes pushes are assumed to be) are displayed in decimal and all other pushes are displayed as hexadecimal.

I wonder why we even try to detect and print numbers differently (and even more so in a different base!), just print all pushes in hex and there is no ambiguity anymore?

MatthewLM commented at 2:21 pm on June 1, 2023: none

They are two different numbers. One is 0x1112131415 and the other is 0x57c74942 (encoded as little-endian). In ASM, they are displayed the same. Using 0x before hex is consistent with hex literals elsewhere and would allow for decimals to be displayed including -1 for OP_1NEGATE.

Displaying all numbers as hex is an option. One problem is that they are given as little-endian which could be confusing for some wishing to interpret them. Off the top of my memory, the scripting interpreter treats 32-bit numbers arithmetically so I think it makes sense to display those separately. The only difference should be that hex and decimal should be distinguished.

darosior commented at 2:28 pm on June 1, 2023: member

Sure, i misspoke. I just meant:

0>>> int.from_bytes(bytes.fromhex("57c74942"), "little")
11112131415

willcl-ark commented at 8:59 am on June 2, 2023: member

It’s a little unclear to me how the ASM output is used on the consumer side, and therefore how breaking any changes to it might be. However I still think we have a few options to resolve this:

Fully hex-encoded output in the ASM repr:

0$ /home/will/src/bitcoin/src/bitcoin-cli -regtest decodescript 0511121314150457c74942
1{
2  "asm": "1112131415 57c74942",

Prefixing hex values with 0x (we would end up with a lot of prefixes in this case):

0$ /home/will/src/bitcoin/src/bitcoin-cli -regtest decodescript 0511121314150457c74942
1{
2  "asm": "0x1112131415 1112131415",

Prefixing decimal values with 0d (only prefixing small pushes):

0$ /home/will/src/bitcoin/src/bitcoin-cli -regtest decodescript 0511121314150457c74942
1{
2  "asm": "1112131415 0d1112131415",

As I think Antoine inticated, I’d have a preference for “full hex”. However I think there may be a small quality of life advantage to prefixing decimal values for persons manually crafting or reading scripts using ASM notation, as e.g. timelock values are most easily thought about in decimal.

Edit: Curious if anyone knows the original rationale for displaying small values as decimal, it was carved out by satoshi.

willcl-ark referenced this in commit ce5f9455f2 on Jun 5, 2023

MatthewLM commented at 12:37 pm on June 5, 2023: none

Edit: Curious if anyone knows the original rationale for displaying small values as decimal, it was carved out by satoshi.

It is because any numerical/arithmetic operations only allow 32-bit values (please see the CScriptNum class), so the ASM displays 32-bit values as decimal and longer values as hex. This makes sense as humans generally better comprehend decimal values when thinking arithmetically, but the values need to be distinguishable.

willcl-ark commented at 2:11 pm on June 5, 2023: member

Well, they would still be 32 bit values, just hex-encoded? But I take the point. I personally think prefixing hex with 0x would be kind of horrible, as this would cascade into including pubkeys for example.

Is prefixing decimals with 0d the best approach then?

I am still unclear on how much impact this would have downstream, and therefore whether any of these discussed changes would be accepted in any form. It seems like this could break much of the little scripting tooling we have available for bitcoin…

MatthewLM commented at 3:20 pm on June 5, 2023: none

Is it a problem prefixing public keys with 0x? I’m not overly fussed on the solution as long as the ambiguity is removed.

It seems like this could break much of the little scripting tooling we have available for bitcoin

If these tools read these ambiguous numbers/data from ASM, then they are already broken, so this needs fixing one way or another.

ajtowns commented at 7:03 am on August 25, 2023: contributor

Maybe https://github.com/ajtowns/bitcoin/commit/36cbf115bda9fb2a28c1f0ec119d6108218389c9 ? It adds a “0x” if the hex is ambiguous (ie none of the digits are a-f). Presumably almost all pubkeys will not be ambiguous (1-in-12-trillion chance) so the 0x won’t get added there and confuse people.

0$ ./bitcoin-cli -regtest decodescript 0511121314150457c74942 | jq .asm
1"0x1112131415 1112131415"
2$ ./bitcoin-cli -regtest decodescript 0511121c14150457c74942 | jq .asm
3"11121c1415 1112131415"

Another approach could be to add the prefix for hex strings of 5 bytes (10 hex digits) – anything that would decode to 8 or fewer hex digits would be decoded as decimal anyway, and anything with >11 decimal digits would be more than 4 bytes originally so would not have been decoded as decimal.

willcl-ark referenced this in commit 8cc430b8b8 on Nov 8, 2023

willcl-ark referenced this in commit d6c0c4bef6 on Nov 8, 2023

luke-jr commented at 9:56 pm on November 8, 2023: member

Even aside from the main issue, it’s still ambiguous for the various opcode possibilities… :/

luke-jr referenced this in commit 08ab70da3c on Nov 9, 2023

sipa commented at 2:16 pm on May 8, 2024: member

The discussion here led to #28824 being opened, where more discussion happened, and the idea evolved further. Having re-read things there, I have a concrete proposal that perhaps goes a bit further, so I’m posting it here to keep PR discussion about the implementation.

The goal is addressing ambiguity and readability, but not necessarily compatibility (because being compatibility with something broken is silly).

Decoding rules

Even if all we expose on the RPC/user side is encoding, I believe it’s useful to formally specify the decoding rules, as they help understand how the encoding can be unambiguous. If this is implemented, I think we want the decoder written too, so that e.g. round-trip fuzz tests can be added for the encoding.

If the asm string in its entirety is an even length hex string, it is decoded raw (i.e. script.size() == asm.size() / 2 exactly).
Otherwise, the script is interpreted as consisting of 1 or more whitespace-separated units, which are each decoded as listed below and concatenated:
- An opcode name, excluding OP_PUSHDATAn, with or without OP_ prefix, is decoded to the single opcode byte.
- A decimal number, optionally prefixed with + or -, is decoded as the script consisting of the minimum push of the minimal encoding of that number (i.e. OP_1NEGATE, OP_0 through OP_16, or direct push), assuming that encoding is no more than 5 bytes. For 0 through 16 this overlaps with the previous rule, but since they mean the same thing this is ok.
- A string surrounded by < and > means a direct push or OP_PUSHDATAx push (whichever is smaller) of the recursively-decoded string between angle brackets. So that string can be just hex data, or it can be another sequence of units.
- A string surrounded by PUSHDATAx< and > means a push using the specified OP_PUSHDATAx push of the recursively-decoded string between angle brackets.
- An even length hex string prefixed with “#” means raw hex decoding that string.

Notably, this enables recursive encoding (e.g. a 2-of-3 multisig P2SH scriptSig could become 0 <sig1> <sig3> <2 <pub1> <pub2> <pub3> CHECKMULTISIG>) .

The ambiguity of asm scripts that consists entirely of an even number of decimal characters is resolved in favor of treating them as raw hex rather than as a decimal integer, because:

Scripts that entirely consist of a single push are uninteresting and presumably rare (they’re anyone-can-spend).
This means that the rules for decoding what’s inside <> are exactly the same as decoding scripts.

Compared to the earlier discussion, it uses # instead of UNPARSABLE:, GARBAGE:, RAW() or any of those. Happy to bikeshed this; it shouldn’t occur frequently.

It drops 0x things entirely, because I think the ones we have in the asm format currently are nonsensical (sometimes they are raw bytes, sometimes they are pushes, but there they use raw byte order while 0x to me would generally mean human-ordered (big endian) numbers (as they do in C and Python). If people feel compatibility with the 0x stuff is desirable, it could be added back, but I’d rather avoid it. Alternatively, 0x for numeric things would make more sense to me (e.g. 0x12345 meaning the script whose hex bytes are “03 45 23 01”), but that’s an even bigger break with what we had before.

Encoding rules

The format allows multiple possible encodings for the same thing (e.g. script byte 00 aka OP_0 can be written as <>, 0, OP_0, or #00), so the encoding is not unique. This is a suggested approach:

For pushes, try in order:
- If just a decimal integer encodes the right thing, use that. This is always the case for (among others) OP_1NEGATE, and OP_0 through OP_16. A + or OP_ in front is only needed if the entire script consists of just a single push whose decimal encoding would have even length otherwise (because confusion with entirely hex-encoded string rule).
- If the push is minimal otherwise, use < + hex encoding of the pushed data + >. If the entire script is push-only, and the last push looks like a script itself, use < + disassembly of the pushed script + > instead.
- Use PUSHDATAx< + hex encoding of pushed data + > otherwise.
Other opcodes are just turned into their name.
Unparsable data becomes # + raw hex encoding.

If everything, or a large fraction of the script, is unparsable just hex encode instead.

ajtowns commented at 1:05 am on May 28, 2025: contributor

I’ve had a go at implementing the above approach; code is at https://github.com/ajtowns/bitcoin/commits/202505-asmdecode/

For testing purposes I’ve changed the decodescript rpc to accept assembly as input, since that’s now a superset of hex encoding. I’ve also done the round-trip fuzzing in the script -> assembly -> script direction; it seems to work so far.

Examples:

 0$ bitcoin-cli -regtest decodescript '51204af12dcd91328acf048740f90666e4f4a88d809f0882ad6f2b4440ddae78cb43' | jq -r .asm
 11 <4af12dcd91328acf048740f90666e4f4a88d809f0882ad6f2b4440ddae78cb43>
 2
 3# from random p2sh pend (txid ac3efa70cd8dfe548d727657e4b56704685fad6a43d934351c88940502ba3bf8)
 4$ bitcoin-cli -regtest decodescript '004830450221009e9bac3a1b91269ff15ff062d0c53779497d563fbb4d360c91504fc3b4199cb1022044caa0a4109c0d4539b7d8331c2f8b4ccba2072e59159b805e827942d55da0110147304402205d16c2f9b0ae72ca5a844b68fa0b0e9f9d188763486808dd526c23966aa43ebe022077e0bbeea907989461236654769b2df8e5aecf19b899a424cc4d09a3902ef9ed014c69522103a34cd2fd1273750453fde17922ea04064292092b8530402e047cc82e60ad9ad4210275738d4b698995c5d4252353d0855d6223d1772a506fe30b3f2b10e43d8307702103fae400923e16776836a075ef27fcb2c63f79be14fd3d48d13f7f14915d87f96053ae' | jq -r .asm
 50 <30450221009e9bac3a1b91269ff15ff062d0c53779497d563fbb4d360c91504fc3b4199cb1022044caa0a4109c0d4539b7d8331c2f8b4ccba2072e59159b805e827942d55da01101> <304402205d16c2f9b0ae72ca5a844b68fa0b0e9f9d188763486808dd526c23966aa43ebe022077e0bbeea907989461236654769b2df8e5aecf19b899a424cc4d09a3902ef9ed01> <522103a34cd2fd1273750453fde17922ea04064292092b8530402e047cc82e60ad9ad4210275738d4b698995c5d4252353d0855d6223d1772a506fe30b3f2b10e43d8307702103fae400923e16776836a075ef27fcb2c63f79be14fd3d48d13f7f14915d87f96053ae>
 6
 7# what's the redeemscript?
 8$ bitcoin-cli -regtest decodescript '522103a34cd2fd1273750453fde17922ea04064292092b8530402e047cc82e60ad9ad4210275738d4b698995c5d4252353d0855d6223d1772a506fe30b3f2b10e43d8307702103fae400923e16776836a075ef27fcb2c63f79be14fd3d48d13f7f14915d87f96053ae' | jq -r .asm
 92 <03a34cd2fd1273750453fde17922ea04064292092b8530402e047cc82e60ad9ad4> <0275738d4b698995c5d4252353d0855d6223d1772a506fe30b3f2b10e43d830770> <03fae400923e16776836a075ef27fcb2c63f79be14fd3d48d13f7f14915d87f960> 3 OP_CHECKMULTISIG
10
11# full source decode
12$ bitcoin-cli -regtest decodescript '0 <30450221009e9bac3a1b91269ff15ff062d0c53779497d563fbb4d360c91504fc3b4199cb1022044caa0a4109c0d4539b7d8331c2f8b4ccba2072e59159b805e827942d55da01101> <304402205d16c2f9b0ae72ca5a844b68fa0b0e9f9d188763486808dd526c23966aa43ebe022077e0bbeea907989461236654769b2df8e5aecf19b899a424cc4d09a3902ef9ed01> <2 <03a34cd2fd1273750453fde17922ea04064292092b8530402e047cc82e60ad9ad4> <0275738d4b698995c5d4252353d0855d6223d1772a506fe30b3f2b10e43d830770> <03fae400923e16776836a075ef27fcb2c63f79be14fd3d48d13f7f14915d87f960> 3 OP_CHECKMULTISIG>' | jq -r .asm
130 <30450221009e9bac3a1b91269ff15ff062d0c53779497d563fbb4d360c91504fc3b4199cb1022044caa0a4109c0d4539b7d8331c2f8b4ccba2072e59159b805e827942d55da01101> <304402205d16c2f9b0ae72ca5a844b68fa0b0e9f9d188763486808dd526c23966aa43ebe022077e0bbeea907989461236654769b2df8e5aecf19b899a424cc4d09a3902ef9ed01> <522103a34cd2fd1273750453fde17922ea04064292092b8530402e047cc82e60ad9ad4210275738d4b698995c5d4252353d0855d6223d1772a506fe30b3f2b10e43d8307702103fae400923e16776836a075ef27fcb2c63f79be14fd3d48d13f7f14915d87f96053ae>