doc: Correct uint256 hex string endianness

hodlinator commented at 7:19 pm on July 25, 2024: contributor

This PR is a follow-up to #30436.

Only changes test-code and modifies/adds comments.

Byte order of hex string representation was wrongfully documented as little-endian, but are in fact closer to “big-endian” (endianness is a memory-order concept rather than a numeric concept). [arith_]uint256 both store their data in arrays with little-endian byte order (arith_uint256 has host byte order within each uint32_t element).

uint256_tests.cpp - Avoid using variable from the left side of the condition in the right side. Credits to @maflcko: #30436 (review)

setup_common.cpp - Skip needless ArithToUint256-conversion. Credits to @stickies-v: #30436 (review)

Logical reasoning for endianness

Comparing an arith_uint256 (base_uint<256>) to a uint64_t compares the beginning of the array, and verifies the remaining elements are zero.

 0template <unsigned int BITS>
 1bool base_uint<BITS>::EqualTo(uint64_t b) const
 2{
 3    for (int i = WIDTH - 1; i >= 2; i--) {
 4        if (pn[i])
 5            return false;
 6    }
 7    if (pn[1] != (b >> 32))
 8        return false;
 9    if (pn[0] != (b & 0xfffffffful))
10        return false;
11    return true;
12}

…that is consistent with little endian ordering of the array.

They have the same endianness (but arith_* has host-ordering of each uint32_t element):

0arith_uint256 UintToArith256(const uint256 &a)
1{
2    arith_uint256 b;
3    for(int x=0; x<b.WIDTH; ++x)
4        b.pn[x] = ReadLE32(a.begin() + x*4);
5    return b;
6}

String conversions

The reversal of order which happens when converting hex-strings <=> uint256 means strings are actually closer to big-endian, see the end of base_blob<BITS>::SetHexDeprecated:

0    unsigned char* p1 = m_data.data();
1    unsigned char* pend = p1 + WIDTH;
2    while (digits > 0 && p1 < pend) {
3        *p1 = ::HexDigit(trimmed[--digits]);
4        if (digits > 0) {
5            *p1 |= ((unsigned char)::HexDigit(trimmed[--digits]) << 4);
6            p1++;
7        }
8    }

Same reversal here:

0template <unsigned int BITS>
1std::string base_blob<BITS>::GetHex() const
2{
3    uint8_t m_data_rev[WIDTH];
4    for (int i = 0; i < WIDTH; ++i) {
5        m_data_rev[i] = m_data[WIDTH - 1 - i];
6    }
7    return HexStr(m_data_rev);
8}

It now makes sense to me that SetHexDeprecated, upon receiving a shorter hex string that requires zero-padding, would pad as if the missing hex chars where towards the end of the little-endian byte array, as they are the most significant bytes. “Big-endian” string representation is also consistent with the case where SetHexDeprecated receives too many hex digits and discards the leftmost ones, as a form of integer narrowing takes place.

How I got it wrong in #30436

Previously I used the less than (<) comparison to prove endianness, but for uint256 it uses memcmp and thereby gives priority to the lower bytes at the beginning of the array.

0    constexpr int Compare(const base_blob& other) const { return std::memcmp(m_data.data(), other.m_data.data(), WIDTH); }

arith_uint256 is different in that it begins by comparing the bytes from the end, as it is using little endian representation, where the bytes toward the end are more significant.

 0template <unsigned int BITS>
 1int base_uint<BITS>::CompareTo(const base_uint<BITS>& b) const
 2{
 3    for (int i = WIDTH - 1; i >= 0; i--) {
 4        if (pn[i] < b.pn[i])
 5            return -1;
 6        if (pn[i] > b.pn[i])
 7            return 1;
 8    }
 9    return 0;
10}

(The commit documents that base_blob::Compare() is doing lexicographic ordering unlike the arith_*-variant which is doing numeric ordering).

DrahtBot commented at 7:19 pm on July 25, 2024: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	paplorinc, ryanofsky

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#30560 (refactor: Add consteval uint256 constructor by hodlinator)
#30377 (refactor: Add consteval uint256{“str”} by hodlinator)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

DrahtBot added the label Docs on Jul 25, 2024

hodlinator force-pushed on Jul 26, 2024

in src/test/uint256_tests.cpp:182 in f01848fb03 outdated

175@@ -172,11 +176,11 @@ BOOST_AUTO_TEST_CASE(methods) // GetHex SetHexDeprecated FromHex begin() end() s
176     BOOST_CHECK_EQUAL(uint256::FromHex(ZeroL.ToString()).value(), uint256());
177 
178     TmpL = uint256::FromHex(R1L.ToString()).value();
179-    BOOST_CHECK_EQUAL_COLLECTIONS(R1L.begin(), R1L.end(), R1Array, R1Array + R1L.size());
180-    BOOST_CHECK_EQUAL_COLLECTIONS(TmpL.begin(), TmpL.end(), R1Array, R1Array + TmpL.size());
181-    BOOST_CHECK_EQUAL_COLLECTIONS(R2L.begin(), R2L.end(), R2Array, R2Array + R2L.size());
182-    BOOST_CHECK_EQUAL_COLLECTIONS(ZeroL.begin(), ZeroL.end(), ZeroArray, ZeroArray + ZeroL.size());
183-    BOOST_CHECK_EQUAL_COLLECTIONS(OneL.begin(), OneL.end(), OneArray, OneArray + OneL.size());
184+    BOOST_CHECK_EQUAL_COLLECTIONS(R1L.begin(), R1L.end(), R1Array, R1Array + uint256::size());

l0rinc commented at 6:24 pm on July 26, 2024:

👍 , if possible, we should do the *_EQUAL changes in the arith_uint256_tests as well (probably best in a different PR).

hodlinator commented at 1:46 pm on August 1, 2024:

Have a similar nagging feeling. Happy to review a future PR that does that. If it only does that type of change maybe there are even more tests that could benefit.

sipa commented at 2:08 pm on July 27, 2024: member

I’m not convinced about this change. The current code does have a number of confusing comments, but I think this confuses some things further.

“Big endian” and “little endian” describe ways to encode/decode numbers as bytes. Byte arrays themselves do not have an endianness (only their interpretation as number does), and it’s even ambiguous to apply the term to hex encodings of bytes. Given an array of bytes being converted to hexadecimal, we can imagine:

A. Those bytes being interpreted as a number (according to some endianness), and then that number is converted to hexadecimal for human consumption (typically most significant hex character first).
B. Those bytes being interpreted as a sequence of single-byte numbers, which are individually converted to hex characters (typically most significant hex character first) and just concatenated.

The hex format used for uint256 could be described as:

(A) Converting to a number by interpreting as little-endian bytes, and then converting that number to hex with most significant character first.
The bytes are reversed, and then (B) individually converted to hex with most significant character first and concatenating.
(B) individually converting the bytes to hex with least significant character first and concatenating, followed by reversing the characters.

None of these descriptions are what I imagine when I read “big-endian hex”.

I believe the most straightforward description is to treat uint256 (despite its name) not as representing a number at all, but as a simple wrapper around 32-byte arrays (if we’d move some logic of it to separate functions, I think it would be totally reasonable to replace it with using uint256 = std::array<std::byte, 32> even), and endianness is then just an inapplicable term.

The hex encoding for uint256 I would then describe as “byte-reversed hex encoding” (perhaps with an explainer “matching the interpretation of the bytes in little-endian as arith_uint256 does, followed by converting that to hex”).
The comparison between uint256 would then be best described as “lexicographic ordering (note: this does not match the ordering on the corresponding arith_uint256 interpretation)”.

On the other hand, arith_uint256 actually does implement a big integer type, which means:

Its byte encoding can properly be described as little-endian.
Its hex encoding can be described as “hex encoding of the number (with the most significant characters first)”.
Its comparison is numeric comparison.

l0rinc commented at 2:40 pm on July 27, 2024: contributor

Thanks for your detailed reasoning, @sipa!

Documenting the endianness was my idea, and I appreciate your insights. I will investigate your claims in more detail later as I am currently traveling.

l0rinc commented at 10:10 am on July 28, 2024: contributor

After further consideration and research, I believe I’ve identified a key point that may help clarify our discussion about endianness in the codebase. We seem to agree that the base16 to base256 conversion can indeed have an endianness. Based on the Wikipedia definition of endianness as “the order in which bytes within a word of digital data are transmitted over a data communication medium or addressed (by rising addresses) in computer memory, counting only byte significance compared to earliness”, the resulting bytes (representing any data, especially if it’s fixed size) can also be stored forward or backward, which aligns with this concept of significance compared to earliness. So according to this, arrays can also have endianness.

Currently, it seems we have a form of double endian encoding in our codebase, where:

The base16 to base256 conversion has an endianness.
The storage of the resulting bytes can also reflect endianness.

This double encoding might be the root of why we viewed this differently.

While I agree that your proposed approach could lead to more precise terminology here, upon digging into the codebase, it appears that endianness terms are used more broadly than your strict interpretation would suggest:

Endianness terminology used for hexadecimal representations:

Transaction IDs: https://github.com/bitcoin/bitcoin/blob/master/src/rpc/mining.cpp#L620
Hashes: https://github.com/bitcoin/bitcoin/blob/master/src/rpc/mining.cpp#L621

Endianness applied to non-numeric data:

uint256 treated as having specific endianness:

In policy code: https://github.com/bitcoin/bitcoin/blob/master/src/policy/packages.h#L93

It also doesn’t help that the uint256 getters (and testers like ArrayToString) themselves reverse the byte order before returning it (https://github.com/bitcoin/bitcoin/blob/master/src/uint256.cpp#L11-L18) and that reading uint64 chunks even considers the native endianness https://github.com/bitcoin/bitcoin/blob/master/src/uint256.h#L79.

The problem doesn’t seem to be as simple as just changing the terminology, it seems to be deeply ingrained in the codebase. I assume that we can’t actually simplify the code in most places (maybe for SipHash-type usages where the result is local and temporary), so we have to either refactor or document, right? How do you suggest we tackle this?

hodlinator force-pushed on Jul 29, 2024

hodlinator commented at 4:24 pm on July 29, 2024: contributor

Thank you for your feedback @paplorinc & @sipa!

I am convinced by sipa’s argument that endianness is primarily an in-memory representation order. Numbers like 4607 (= 0x11FF) do not have an endianness to them, neither should strings containing them, even though it is very tempting for hexadecimal.

07de45a882d43fd5567b2a6b7c8a2da3d6dbda3d contains much of sipa’s input. Not sure what to do with pre-existing code @paplorinc found that uses endianness terminology imprecisely.

Transaction IDs: https://github.com/bitcoin/bitcoin/blob/master/src/rpc/mining.cpp#L620

The RPC specifies txid as “transaction id encoded in little-endian hexadecimal”.

I dug into the behavior of the HashWriter used in CTransaction::ComputeHash() to compute txids:

 0diff --git a/src/test/hash_tests.cpp b/src/test/hash_tests.cpp
 1index 51f1d4c840..590af57ff6 100644
 2--- a/src/test/hash_tests.cpp
 3+++ b/src/test/hash_tests.cpp
 4@@ -2,6 +2,7 @@
 5 // Distributed under the MIT software license, see the accompanying
 6 // file COPYING or http://www.opensource.org/licenses/mit-license.php.
 7 
 8+#include "uint256.h"
 9 #include <clientversion.h>
10 #include <crypto/siphash.h>
11 #include <hash.h>
12@@ -146,6 +147,10 @@ BOOST_AUTO_TEST_CASE(siphash)
13         BOOST_CHECK_EQUAL(SipHashUint256(k1, k2, x), sip256.Finalize());
14         BOOST_CHECK_EQUAL(SipHashUint256Extra(k1, k2, x, n), sip288.Finalize());
15     }
16+
17+    HashWriter writer{};
18+    writer.write(std::vector<std::byte>{std::byte{'1'}, std::byte{'2'}, std::byte{'3'}});
19+    BOOST_CHECK_EQUAL(writer.GetSHA256(), uint256S("a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3"));
20 }
21 
22 BOOST_AUTO_TEST_SUITE_END()

The “a665a45.."-string was provided by 2 online SHA256 calculators hashing “123” as the code above.

Result:

0test/hash_tests.cpp(153): error: in "hash_tests/siphash": check writer.GetSHA256() == uint256S("a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3") has failed
1[e37aa2f7f7868e997ea01fff3f1f4aa0b84fdcef67487e419d2f422059a465a6 !=
2 a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3]

Conclusion: Byte order is unnecessarily reversed in this case, but now hard to change. Should probably be documented as “transaction id encoded in reverse-order hexadecimal” but this PR is just concerned with shortcomings of #30436.

in src/arith_uint256.h:29 in 07de45a882 outdated

25@@ -26,6 +26,7 @@ class base_uint
26 protected:
27     static_assert(BITS / 32 > 0 && BITS % 32 == 0, "Template parameter BITS must be a positive multiple of 32.");
28     static constexpr int WIDTH = BITS / 32;
29+    /** Fully little-endian byte ordering on little-endian architectures. */

ryanofsky commented at 10:18 pm on July 31, 2024:

In commit “doc + test: Correct uint256 hex string endianness” (07de45a882d43fd5567b2a6b7c8a2da3d6dbda3d)

Maybe just say number is stored in the array with least significant bits first and most significant bits last.

The current comment describing byte ordering on one specific architecture seems like an indirect way of describing the array when array elements aren’t bytes, and the class itself is architecture independent.

hodlinator commented at 1:18 pm on August 1, 2024:

Maybe just say number is stored in the array with least significant bits first and most significant bits last.

That is incorrect. The least significant byte is stored first for uint32_t on the little-endian archs. If one wants to avoid mentioning architecture, one could possibly say “the number is stored in an array with the least significant uint32_t chunk first”.

The current comment describing byte ordering on one specific architecture seems like an indirect way of describing the array when array elements aren’t bytes, and the class itself is architecture independent.

Do you mean mentioning architecture on a protected member instills fear that the users of the class will be exposed to architecture dependent behavior?

ryanofsky commented at 10:44 pm on August 1, 2024:

Maybe just say number is stored in the array with least significant bits first and most significant bits last.

That is incorrect. The least significant byte is stored first for uint32_t on the little-endian archs. If one wants to avoid mentioning architecture, one could possibly say “the number is stored in an array with the least significant uint32_t chunk first”.

It is correct, it’s just saying the highest bits of the number are in the last array element and the lowest bits of the number are in the first array element. But you could definitely say chunks instead of bits if you prefer that.

The current comment describing byte ordering on one specific architecture seems like an indirect way of describing the array when array elements aren’t bytes, and the class itself is architecture independent.

Do you mean mentioning architecture on a protected member instills fear that the users of the class will be exposed to architecture dependent behavior?

It’s just confusing and indirect.

hodlinator commented at 11:02 pm on August 2, 2024:

It is correct, it’s just saying the highest bits of the number are in the last array element and the lowest bits of the number are in the first array element. But you could definitely say chunks instead of bits if you prefer that.

It is correct if by “bits” you mean 32-bit chunks together I guess, then the lowest is at the beginning and the highest at the end. I was somehow wrongly interpreting you as meaning the highest bit (singular) at the end + the lowest bit at the beginning.

It’s just confusing and indirect.

Would this feel less confusing?

0    /** Number is stored as fully little-endian on little-endian archs. */
1    uint32_t pn[WIDTH];

..? Then it at least specifies what is being stored (but it’s the only data-member of base_uint so could be considered overly verbose IMO). Kept it short to stay within 80 char line.

sipa commented at 11:19 pm on August 2, 2024:

I would suggest just

0/** Number represented in least-significant-first base 2^32 digits. */

or so. The internal memory representation of the uint32_t values is entirely irrelevant as far as the language, or users’ understanding is concerned for the correctness of class arith_uint256.

hodlinator commented at 7:11 pm on August 3, 2024:

Thank you both for taking the time to feedback. My suggestions still feels more explanatory & direct to my mind. In light of that I’m leaning toward not adding that line at all in this PR. The main point of the PR was to correct my clearly incorrect and misleading comments that had already been merged after all.

ryanofsky commented at 1:25 am on August 5, 2024:

re: #30526 (review)

Thank you both for taking the time to feedback. My suggestions still feels more explanatory & direct to my mind. In light of that I’m leaning toward not adding that line at all in this PR. The main point of the PR was to correct my clearly incorrect and misleading comments that had already been merged after all.

I do think your comment is in improvement, since it documents the array order, just indirectly.

To describe the reasoning behind the suggestions with an analogy, imagine there’s a school field trip, the kids are taking buses, and you want to describe the order of the buses. You could say:

The kids in lower grades will be on the first buses, and the higher grades will be on the later buses (my version)
The students will get on the buses in lowest-to-highest grade order (sipa’s version)
The big endian representation of students grades as they get on the buses will increase lexicographically. (your version)

This analogy is a little rough but it should illustrate the point that there’s need to mention the internal representation of numbers that you are only using as numbers, not bytes, in code that is not using memcpy, memcmp, or similar functions.

I think sipa’s suggestion (“Number represented in least-significant-first base 2^32 digits.”) is good, but if it’s too mathematical, there are ways to make it sound less mathy like replacing “base 2^32 digits” with “uint32_t pieces” (or parts or chunks or smaller numbers).

in src/arith_uint256.h:223 in 07de45a882 outdated

219@@ -218,6 +220,8 @@ class base_uint
220     friend inline bool operator==(const base_uint& a, uint64_t b) { return a.EqualTo(b); }
221     friend inline bool operator!=(const base_uint& a, uint64_t b) { return !a.EqualTo(b); }
222 
223+    /** Hex encoding of the number (with the most significant characters first).

ryanofsky commented at 10:20 pm on July 31, 2024:

In commit “doc + test: Correct uint256 hex string endianness” (07de45a882d43fd5567b2a6b7c8a2da3d6dbda3d)

Maybe say digits instead of characters to be less generic. Also I think “most significant digits” is a more commonly understood phrase than “most significant characters”. Would suggest:

Returns representation of the number in hex (with most significant digits first)

l0rinc commented at 10:56 am on August 1, 2024:

👍, also nit: can be single-line, like the rest of these comments

hodlinator commented at 1:29 pm on August 1, 2024:

Fits within 80 chars line with “characters” -> “digits”. :+1:

in src/test/arith_uint256_tests.cpp:27 in 07de45a882 outdated

22@@ -22,7 +23,8 @@ static inline arith_uint256 arith_uint256V(const std::vector<unsigned char>& vch
23 {
24     return UintToArith256(uint256(vch));
25 }
26-static inline arith_uint256 arith_uint256S(const std::string& str) { return UintToArith256(uint256S(str)); }
27+// Takes a hex string with the most significant bytes first.
28+static inline arith_uint256 arith_uint256S(std::string_view str) { return UintToArith256(uint256S(str)); }

ryanofsky commented at 10:21 pm on July 31, 2024:

In commit “doc + test: Correct uint256 hex string endianness” (07de45a882d43fd5567b2a6b7c8a2da3d6dbda3d)

Would say digits instead of bytes here, for the same reason as suggesting digits instead of characters. Would suggest:

Takes a number written in hex (with most significant digits first)

in src/test/arith_uint256_tests.cpp:287 in 07de45a882 outdated

280@@ -278,6 +281,12 @@ BOOST_AUTO_TEST_CASE( comparison ) // <= >= < >
281         BOOST_CHECK( R1L <= TmpL ); BOOST_CHECK( (R1L == TmpL) != (R1L < TmpL)); BOOST_CHECK( (TmpL == R1L) || !( R1L >= TmpL));
282         BOOST_CHECK(! (TmpL < R1L)); BOOST_CHECK(! (R1L > TmpL));
283     }
284+
285+    BOOST_CHECK_LT(ZeroL,
286+                   OneL);
287+    // Verify hex string representation has the most significant bytes first.

ryanofsky commented at 10:25 pm on July 31, 2024:

In commit “doc + test: Correct uint256 hex string endianness” (07de45a882d43fd5567b2a6b7c8a2da3d6dbda3d)

Would say digits instead of bytes here to be less ambiguous. When there’s a hex encoding, “byte” could refer either to a byte of the hex string or to a byte of the data it represents, and I think you mean the former (because the digits in the two strings are what’s reversed, not the bytes). Would suggest:

Verify hex number representation has the most significant digits first.

hodlinator commented at 1:40 pm on August 1, 2024:

When there’s a hex encoding, “byte” could refer either to a byte of the hex string or to a byte of the data it represents, and I think you mean the former (because the digits in the two strings are what’s reversed, not the bytes).

I mean that FF represents a byte will all bits set, so I guess the latter. The individual hex digits are NOT reversed, 0F is not interpreted as F0, it is the order of the bytes that is reversed, 12EF => EF12.

Changed “string” -> “number”.

in src/test/arith_uint256_tests.cpp:289 in 07de45a882 outdated

280@@ -278,6 +281,12 @@ BOOST_AUTO_TEST_CASE( comparison ) // <= >= < >
281         BOOST_CHECK( R1L <= TmpL ); BOOST_CHECK( (R1L == TmpL) != (R1L < TmpL)); BOOST_CHECK( (TmpL == R1L) || !( R1L >= TmpL));
282         BOOST_CHECK(! (TmpL < R1L)); BOOST_CHECK(! (R1L > TmpL));
283     }
284+
285+    BOOST_CHECK_LT(ZeroL,
286+                   OneL);
287+    // Verify hex string representation has the most significant bytes first.
288+    BOOST_CHECK_LT(arith_uint256S("0000000000000000000000000000000000000000000000000000000000000001"),
289+                   arith_uint256S("1000000000000000000000000000000000000000000000000000000000000000"));

ryanofsky commented at 10:25 pm on July 31, 2024:

re: In commit “doc + test: Correct uint256 hex string endianness” (07de45a882d43fd5567b2a6b7c8a2da3d6dbda3d)

I think if you wanted to make the test a stronger demonstration of the fact that the first hex digit is the most significant digit, you would make the other digits in the smaller number as high as possible to show that they don’t effect the comparison. E.g.

0    BOOST_CHECK_LT(arith_uint256S("0FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"),
1                   arith_uint256S("1000000000000000000000000000000000000000000000000000000000000000"));

hodlinator commented at 1:42 pm on August 1, 2024:

Thanks for the suggestion but to me the current way makes it more clear.

in src/test/uint256_tests.cpp:164 in 07de45a882 outdated

162+    BOOST_CHECK_LT(*R2L.begin(), *R1L.begin());
163+    // Ensure last element comparisons give a different result (swapped params):
164+    BOOST_CHECK_LT(*(R1L.end()-1), *(R2L.end()-1));
165+    // Hex string representation is reversed, before lexicographic ordering:
166+    BOOST_CHECK_LT(uint256S("1000000000000000000000000000000000000000000000000000000000000000"),
167+                   uint256S("0000000000000000000000000000000000000000000000000000000000000001"));

ryanofsky commented at 10:27 pm on July 31, 2024:

In commit “doc + test: Correct uint256 hex string endianness” (07de45a882d43fd5567b2a6b7c8a2da3d6dbda3d)

I think if you wanted to make the test a stronger demonstration of the fact that last bytes in the hex strings determine the ordering, you would make the first bytes in the earlier hex string as high as possible to show that they don’t effect the comparison. E.g.

0    BOOST_CHECK_LT(uint256S("FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00"),
1                   uint256S("0000000000000000000000000000000000000000000000000000000000000001"));

l0rinc commented at 11:05 am on August 1, 2024:

Good observations

hodlinator commented at 7:05 pm on August 1, 2024:

Again, to me the current version feels clearer - like comparing an ant to an elephant. Your suggestion to me is like comparing an African elephant to an Indian one. One is probably visibly bigger, but you have to ask an extra time which one is which.

in src/uint256.h:75 in 07de45a882 outdated

72+    /** Unlike FromHex this accepts any invalid input, thus it is fragile and deprecated!
73+     *
74+     * Hex strings that don't specify enough bytes to fill the internal array
75+     * will be treated as setting the beginning of it.
76+     * Hex strings specifying too many bytes will have the beginning of the
77+     * string narrowed away.

ryanofsky commented at 10:31 pm on July 31, 2024:

In commit “doc + test: Correct uint256 hex string endianness” (07de45a882d43fd5567b2a6b7c8a2da3d6dbda3d)

I think this description is somewhat helpful but missing important information, like the fact that it is reversing the hex string.

Would suggest:

This function strips optional whitespace and “0x” prefixes and then copies pairs of hex digits from the end of the hex string, moving backwards, into the beginning bytes of the uint, moving forwards. If too many hex digits are provided, the digits at the beginning of the hex string are ignored. If too few hex digits are provided, the bytes at the end of the uint will be filled with 0’s. If there are too few hex digits, and the number of hex digits is odd, the high order bits of the last byte copied will be 0. For example if the hex string 0x123 is provided, the resulting blob will contain bytes {0x23, 0x1, 0x0, ..., 0x0}

hodlinator commented at 8:22 pm on August 1, 2024:

My goal with adding that comment to the deprecated function was to reason through part of how the “endianness” behavior made sense for off-length hex numbers because I found it remarkable. It wasn’t to describe everything the function does. I’ve attempted to clarify the existing comment with inspiration from https://github.com/bitcoin/bitcoin/pull/30526/files#r1699196174. Took your example about half-bytes.

in src/uint256.h:67 in 07de45a882 outdated

63     friend constexpr bool operator!=(const base_blob& a, const base_blob& b) { return a.Compare(b) != 0; }
64     friend constexpr bool operator<(const base_blob& a, const base_blob& b) { return a.Compare(b) < 0; }
65 
66-    // Hex string representations are little-endian.
67+    /* Hex string representations in reverse byte order of the internal array.
68+     * This is consistent with the internal array of arith_* being little-endian.

ryanofsky commented at 11:04 pm on July 31, 2024:

In commit “doc + test: Correct uint256 hex string endianness” (07de45a882d43fd5567b2a6b7c8a2da3d6dbda3d)

Statement “This is consistent with the internal array of arith_* being little-endian.” seems a bit vague to me. Also calling internal representation of the arithmetic class little endian might not be ideal because endianness normally refers to bytes, and the arithmetic class stores an array of uint32_t numbers, not of bytes.

I think I would suggest changing line to:

The reverse-byte hex representation is a convenient way to view the blob as a number, because it is consistent with the way the base_uint class converts blobs to numbers.

base_uint treats the blob as an array of bytes with the numerically least significant byte first and the most significant byte last. Because numbers are typically written with the most significant digit first and the least significant digit last, the reverse hex display of the blob corresponds to the same numeric value that base_uint interprets from the blob.

l0rinc commented at 11:01 am on August 1, 2024:

👍, I like that we’re not saying endianness for arrays anymore (as suggested by @sipa as well), but seems we still do here. Would be nice if we could eventually unify throughout the codebase.

hodlinator commented at 2:05 pm on August 1, 2024:

@ryanofsky Is that what they call copyrighting skills? Regardless I’ll take it. Very informative and precise. @paplorinc This was what made me still feel it was okay to use endianness terminology:

Quoting sipa #30526 (comment):

Its byte encoding can properly be described as little-endian.

He was more against applying endianness to to numbers, as it is a memory-order concept for bytes. But I agree avoiding the concept here because of the uint32_t involved is wise.

ryanofsky approved

ryanofsky commented at 11:14 pm on July 31, 2024: contributor

Code review ACK 07de45a882d43fd5567b2a6b7c8a2da3d6dbda3d. I do think this is an improvement over the status quo, but I think a lot of things were less clear than they could be so I left a lot of suggestions.

in src/uint256.h:72 in 07de45a882 outdated

69+     */
70     std::string GetHex() const;
71-    /** Unlike FromHex this accepts any invalid input, thus it is fragile and deprecated */
72+    /** Unlike FromHex this accepts any invalid input, thus it is fragile and deprecated!
73+     *
74+     * Hex strings that don't specify enough bytes to fill the internal array

l0rinc commented at 11:02 am on August 1, 2024:

nit: the hex strings don’t actually provide bytes, the bytes are computed, right?

hodlinator commented at 2:00 pm on August 1, 2024:

They specify which bytes are to be computed? :) Maybe there’s some other way of making it clearer.

in src/uint256.h:125 in 07de45a882 outdated

104@@ -93,7 +105,7 @@ class base_blob
105 
106 namespace detail {
107 /**
108- * Writes the hex string (treated as little-endian) into a new uintN_t object
109+ * Writes the hex string (in reverse byte order) into a new uintN_t object

l0rinc commented at 11:03 am on August 1, 2024:

👍 I like this wording a lot more

in src/uint256.h:57 in 07de45a882 outdated

52@@ -53,15 +53,27 @@ class base_blob
53         std::fill(m_data.begin(), m_data.end(), 0);
54     }
55 
56+    /** Lexicographic ordering
57+     * @note Does NOT match the ordering on the corresponding \ref

l0rinc commented at 11:04 am on August 1, 2024:

this is really messy, I wonder if we have a chance to untangle any of this - or whether documenting it is our only possible solution

in src/arith_uint256.h:199 in 07de45a882 outdated

196@@ -196,6 +197,7 @@ class base_uint
197         return ret;
198     }
199 
200+    /** Numeric ordering (unlike \ref base_blob::Compare) */

l0rinc commented at 11:07 am on August 1, 2024:

no strong preference either way, just noticing that in other cases it’s rather called https://en.wikipedia.org/wiki/Natural_sort_order

hodlinator commented at 1:26 pm on August 1, 2024:

Reading up on that page makes me think it applies mostly to numbers of different widths within the context of alphabetical ordering of strings, but base_uints are arguably zero-padded to the same width and no alphabetical ordering of strings is taking place. This seems just as appropriate: https://en.wikipedia.org/wiki/Collation#Numerical_and_chronological

(Wording based on sipa’s #30526 (comment):

Its comparison is numeric comparison.

l0rinc approved

l0rinc commented at 11:09 am on August 1, 2024: contributor

utACK 07de45a882d43fd5567b2a6b7c8a2da3d6dbda3d

I love that we’re striving to making the campground cleaner than we found it, we need more of these kind of changes <3 Agree with @ryanofsky, some of the comments could still use some clarification, but I’m fine with it as-is as well.

hodlinator force-pushed on Aug 1, 2024

DrahtBot added the label CI failed on Aug 1, 2024

DrahtBot commented at 9:20 pm on August 1, 2024: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/28237349796

Make sure to run all tests locally, according to the documentation.

The failure may happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

hodlinator commented at 9:21 pm on August 1, 2024: contributor

Latest force push to fix CI failure: https://github.com/bitcoin/bitcoin/pull/30526/checks?check_run_id=28237349796

in src/uint256.h:95 in 0828a3601c outdated

92+     *   byte being zero.
93+     *   "0x123" => {0x23, 0x1, 0x0, ..., 0x0}
94+     */
95     void SetHexDeprecated(std::string_view str);
96     std::string ToString() const;
97+    /**@}*/

l0rinc commented at 9:31 pm on August 1, 2024:

was this deliberate? if so, what does it mean?

hodlinator commented at 10:14 pm on August 1, 2024:

https://www.doxygen.nl/manual/grouping.html#memgroup

Check out make -j<cores> docs.

(Inspired by https://github.com/bitcoin/bitcoin/pull/16545/files#diff-319586d074603b695c1715b55a85034a46e7fa74a4fc8ce9d988015db40d64c3R277).

l0rinc commented at 5:59 am on August 2, 2024:

hah, haven’t seen that yet, thanks

DrahtBot removed the label CI failed on Aug 1, 2024

ryanofsky approved

ryanofsky commented at 10:55 pm on August 1, 2024: contributor

Code review ACK 0828a3601c7f323f815598a41dd7854ef25265a7. Thanks for considering the suggestions, and I like the way you now grouped the reverse-hex functions together and added an overall description of the representation.

I do think it would be better if the base_uint class try to didn’t try to document the array by describing its memory layout on little endian machines. I think it would be clearer if it just said what order the array elements are in.

DrahtBot requested review from l0rinc on Aug 1, 2024

in src/test/arith_uint256_tests.cpp:287 in 0828a3601c outdated

280@@ -278,6 +281,12 @@ BOOST_AUTO_TEST_CASE( comparison ) // <= >= < >
281         BOOST_CHECK( R1L <= TmpL ); BOOST_CHECK( (R1L == TmpL) != (R1L < TmpL)); BOOST_CHECK( (TmpL == R1L) || !( R1L >= TmpL));
282         BOOST_CHECK(! (TmpL < R1L)); BOOST_CHECK(! (R1L > TmpL));
283     }
284+
285+    BOOST_CHECK_LT(ZeroL,
286+                   OneL);
287+    // Verify hex number representation has the most significant bytes first.

ryanofsky commented at 11:20 pm on August 1, 2024:

re: #30526 (review)

Still think it would be better to say digits instead of bytes here because this is referring to a number string, not a byte array.

re: #30526 (review)

When there’s a hex encoding, “byte” could refer either to a byte of the hex string or to a byte of the data it represents, and I think you mean the former (because the digits in the two strings are what’s reversed, not the bytes).

I mean that FF represents a byte will all bits set, so I guess the latter. The individual hex digits are NOT reversed, 0F is not interpreted as F0, it is the order of the bytes that is reversed, 12EF => EF12.

I don’t understand why you are thinking about bytes at all here when this is a number string. When I am saying the digits are reversed I am saying “12345” is becoming “54321”. I am not saying “12345” is becoming “103254”. In your test you making sure the number “0000000000000000000000000000000000000000000000000000000000000001” is less than the number “1000000000000000000000000000000000000000000000000000000000000000”. The digits in these numbers are reversed, not the bytes. The second number is straightforwardly higher than the first because the highest digit is bigger (1 > 0). It is true you can also say that the highest byte is bigger (10 > 0), but it’s an odd thing to say because this is just a number and there should be no need to think about bytes at all to make sure a numeric comparison is working correctly.

hodlinator commented at 7:47 pm on August 3, 2024:

When I am saying the digits are reversed I am saying “12345” is becoming “54321”. I am not saying “12345” is becoming “103254”

My main point was that it would rather be “012345” -> “452301” if interpreted as hex, rather than any of the others.

because the digits in the two strings are what’s reversed, not the bytes

I got hung up on that because to me digits are the individual UTF-8 chars.

Verify hex number representation has the most significant digits first

I am trying to cater to programmers over mathematicians, but I think that’s an acceptable suggestion in the arith_uint256S case after thinking about it some more, thanks.

Sorry for my contribution to derailing the conversation.

ryanofsky commented at 1:23 am on August 5, 2024:

re: #30526 (review)

because the digits in the two strings are what’s reversed, not the bytes

I got hung up on that because to me digits are the individual UTF-8 chars.

Right, by “bytes” you meant bytes of the number encoded of as a string, not bytes of the number in memory. It’s just good to avoid the word “bytes” to avoid that ambiguity.

I am trying to cater to programmers over mathematicians, but I think that’s an acceptable suggestion in the arith_uint256S case after thinking about it some more, thanks.

Sorry for my contribution to derailing the conversation.

No problem, I think it’s good to have these exchanges if they clarify our thinking and improve the final result.

Mahmoud198425 approved

l0rinc commented at 6:00 am on August 2, 2024: contributor

ACK 0828a3601c7f323f815598a41dd7854ef25265a7

doc + test: Correct uint256 hex string endianness

Follow-up to #30436.

uint256 string representation was wrongfully documented as little-endian due to them being reversed by GetHex() etc, and base_blob::Compare() giving most significance to the beginning of the internal array. They are closer to "big-endian", but this commit tries to be even more precise than that.

uint256_tests.cpp - Avoid using variable from the left side of the condition in the right side.

setup_common.cpp - Skip needless ArithToUint256-conversion.

73e3fa10b4

hodlinator force-pushed on Aug 3, 2024

l0rinc approved

l0rinc commented at 8:11 pm on August 3, 2024: contributor

ACK 73e3fa10b4dd63b7767d6b6f270df66971067341

DrahtBot requested review from ryanofsky on Aug 3, 2024

ryanofsky approved

ryanofsky commented at 1:19 am on August 5, 2024: contributor

Code review ACK 73e3fa10b4dd63b7767d6b6f270df66971067341

Just tweaked code comments since last review. I think this can be merged even though from the review comments it seems like there could be more more followups later. Everything here now seems like an improvement over the status quo.

in src/uint256.h:68 in 73e3fa10b4

64     friend constexpr bool operator<(const base_blob& a, const base_blob& b) { return a.Compare(b) < 0; }
65 
66-    // Hex string representations are little-endian.
67+    /** @name Hex representation
68+     *
69+     * The reverse-byte hex representation is a convenient way to view the blob

ryanofsky commented at 2:22 am on August 5, 2024:

In commit “doc + test: Correct uint256 hex string endianness” (73e3fa10b4dd63b7767d6b6f270df66971067341)

I think this documentation could use an introduction that gives some context. Would change to:

The hex representation used by GetHex(), ToString(), FromHex() and ParseHexDeprecated() is unconventional, as it shows bytes in reverse order. For instance, a 32-bit blob {0x12, 0x34, 0x56, 0x78} is represented as "78563412" instead of the more typical "12345678" format used by tools like Unix’s xxd or Python’s bytes.hex() and bytes.fromhex() functions to represent the same binary data.

The reverse-byte hex representation of the blob is equivalent to treating the blob as a little-endian number, and then displaying that number in base 16. So it is consistent with the way the base_uint class converts blobs to numbers, and could be a convenient way to view blobs that are numbers.

ryanofsky merged this on Aug 5, 2024

ryanofsky closed this on Aug 5, 2024

ryanofsky commented at 3:20 am on August 5, 2024: contributor

Merged with only 2 acks, since this is a documentation & test change.

It does seem like there are some possible followups here:

Providing more context for reverse-byte hex representation #30526 (review)
Improving RPC documentation as suggested by paplorinc in #30526 (comment)
Documenting base_uint::pn array #30526 (review)
Maybe adding stricter tests for comparison functions #30526 (review) and #30526 (review)

re: #30526 (comment)

While I agree that your proposed approach could lead to more precise terminology here, upon digging into the codebase, it appears that endianness terms are used more broadly than your strict interpretation would suggest:

Endianness terminology used for hexadecimal representations:

Transaction IDs: https://github.com/bitcoin/bitcoin/blob/master/src/rpc/mining.cpp#L620

Hashes: https://github.com/bitcoin/bitcoin/blob/master/src/rpc/mining.cpp#L621

These are good finds and I think hodlinator’s suggestion to document these as “transaction id encoded in reverse-order hexadecimal” could an improvement, although maybe it is ambiguous. “Reverse-order hexadecimal” sounds like the hexadecimal string might be reversed, rather than the bytes represented by the hexadecimal string being reversed. I might suggest changing documentation of these fields to:

txid hash with bytes reversed, encoded in hex
wtxid hash with bytes reversed, encoded in hex

but maybe something else would be better.

Endianness applied to non-numeric data:

Base58 encoding: https://github.com/bitcoin/bitcoin/blob/5d280130446d57d653c749005a2e363265d87686/src/base58.cpp#L98

I think this comment is using “big-endian” correctly, though maybe not very clearly. The idea behind base58 encoding is that you take a byte array, interpret the byte array as a big-endian number, and display that number in base-58 (with a special prefix of 1’s if the original byte array began with 0’s). (The python version of this function is much easier to understand, for reference.)

CRC32: https://github.com/bitcoin/bitcoin/blob/5d280130446d57d653c749005a2e363265d87686/src/crc32c/src/crc32c_read_le.h#L17

This seems like a totally correct comment. This function is taking bytes and converting them to an integer. Seems like a straightforward case of using “endianness” to describe byte order of numeric data.

SipHash: https://github.com/bitcoin/bitcoin/blob/5d280130446d57d653c749005a2e363265d87686/src/crypto/siphash.h#L25

This also seems correct. The siphash class was two Write methods. There is a Write method take bytes as input, and another Write method that takes 64 bit numbers as input. The documentation is just saying if you give that one a number, the little endian representation of the number is what is hashed.

uint256 treated as having specific endianness:

In policy code: https://github.com/bitcoin/bitcoin/blob/5d280130446d57d653c749005a2e363265d87686/src/policy/packages.h#L93

This seems like a borderline case and probably would be good to clarify. It’s just describing the way the wtxids are sorted before they are hashed. Instead of the implementation sorting the wtxid so they are in lexigraphic order, it sorts them so that the reversed bytes of the wtxid’s would be in lexicographic order. This is equivalent to treating the wtxid bytes as little endian numbers and sorting the numbers.

It also doesn’t help that the uint256 getters (and testers like ArrayToString) themselves reverse the byte order before returning it (https://github.com/bitcoin/bitcoin/blob/5d280130446d57d653c749005a2e363265d87686/src/uint256.cpp#L11-L18)

This isn’t great but I think it’s baked in and the best thing we can do is document it.

and that reading uint64 chunks even considers the native endianness https://github.com/bitcoin/bitcoin/blob/5d280130446d57d653c749005a2e363265d87686/src/uint256.h#L79.

This should be fine, because it is only using native endianness as an optimization, and does the same thing on big and little endian systems. It is just interpreting bytes of the blob as a little endian number, and on little endian systems this can be done efficiently with memcpy.

The problem doesn’t seem to be as simple as just changing the terminology, it seems to be deeply ingrained in the codebase. I assume that we can’t actually simplify the code in most places (maybe for SipHash-type usages where the result is local and temporary), so we have to either refactor or document, right? How do you suggest we tackle this?

I think we can clean up documentation various places, name functions better, and maybe in the long term move towards sipa’ suggestion to replace these classes with using uint256 = std::array<std::byte, 32>.

hodlinator deleted the branch on Aug 5, 2024

Theghost256 approved

doc: Correct uint256 hex string endianness #30526

Logical reasoning for endianness

String conversions

How I got it wrong in #30436

Code Coverage

Reviews

Conflicts