Follow-ups to Bech32 error detection #23577

meshcollider commented at 4:07 AM on November 23, 2021: contributor

A number of follow-ups and improvements to the bech32 error location code, introduced in #16807.

Notably, this removes the hardcoded GF1024 tables in favour of constexpr table generation.

Address review comments for Bech32 error validation bb4d3e9b97

Improve Bech32 boost tests 92f0cafdca

Report encoding type in bech32 error message c8b9a224e7

meshcollider added the label RPC/REST/ZMQ on Nov 23, 2021

meshcollider added the label Wallet on Nov 23, 2021

meshcollider commented at 4:08 AM on November 23, 2021: contributor

@ryanofsky and @sipa, this addresses most of your review comments in #16807 so you may like to take a look, thanks!

laanwj commented at 10:15 AM on November 24, 2021: member

Notably, this removes the hardcoded GF1024 tables in favour of constexpr table generation.

Wow, this makes the compiler generate the table? Agree this is better for understandability, just wonder, does this have any noticeable impact on compilation time?

meshcollider commented at 9:26 PM on November 24, 2021: contributor

@laanwj Wow, this makes the compiler generate the table? Agree this is better for understandability, just wonder, does this have any noticeable impact on compilation time?

It shouldn't do, each element of the table only takes a few bitwise operations on 16 bit integers and the table is only 1024 elements.

EDIT: a quick test compiling that file using clang takes 0.24 on master and 0.28 with this PR so it is pretty negligible overall.

MarcoFalke added this to the milestone 23.0 on Nov 25, 2021

laanwj commented at 1:42 PM on November 29, 2021: member

I checked the assembly output from clang 13.0.0 and gcc 9.3.0 that the tables are indeed a) generated constants in .rodata and b) different from before (as expected and mentioned in the commit description) c) match between compilers.

	.type	_ZN6bech3212_GLOBAL__N_110GF1024_LOGE,@object # @_ZN6bech3212_GLOBAL__N_110GF1024_LOGE
	.section	.rodata,"a",@progbits
	.p2align	1
_ZN6bech3212_GLOBAL__N_110GF1024_LOGE:
	.short	65535                           # 0xffff
	.short	0                               # 0x0
	.short	99                              # 0x63
	.short	363                             # 0x16b
	.short	198                             # 0xc6
⋮
	.type	_ZN6bech3212_GLOBAL__N_110GF1024_EXPE,@object # @_ZN6bech3212_GLOBAL__N_110GF1024_EXPE
	.p2align	1
_ZN6bech3212_GLOBAL__N_110GF1024_EXPE:
	.short	1                               # 0x1
	.short	32                              # 0x20
	.short	311                             # 0x137
	.short	139                             # 0x8b
	.short	206                             # 0xce
⋮

Full values for both tables can be found here: https://0bin.net/paste/Lp6hKmVD#EwTuc1KFIDzy57g0g3EvtQiLIj3HWD4e+OGPWUJTwQ+ (I did not have anything to check them against).

MarcoFalke commented at 1:46 PM on November 29, 2021: member

(I did not have anything to check them against).

Looks like the GF1024_EXP you posted differs from the one that is removed in this pull?

sipa commented at 1:55 PM on November 29, 2021: member

I will review in detail shortly, but I suspect that @meshcollider perhaps included the table changes corresponding to https://github.com/sipa/bech32/pull/64 here? There is Sage code there that generates the tables too, if someone wants something to compare against.

laanwj commented at 1:56 PM on November 29, 2021: member

Looks like the GF1024_EXP you posted differs from the one that is removed in this pull?

Both tables are. From the commit:

Note that the tables generated by this code are different to the previous hardcoded tables, because we simplify to encoding (e) as 1 || 0 rather than 9 || 15 (as was done in PR 64 of the bech32 repo).

I began this test with the intent to compare if they were the same, then noticed they were not. But it's expected (they use a different encoding).

Edit: I'll check against @sipa's sage output. Edit2: All values match. The only difference is that the size of GF1024_EXP printed by the Sage script is 1024, while the structure here is 1023 long.

meshcollider commented at 8:52 PM on November 29, 2021: contributor

I'm happy to split that commit into two, the first deriving the existing table and then the second modifying to the new table, if that would help review?

The length difference (1024 vs 1023) is just because the sage code also includes (a)^1023 = 1 = (a)^0.

in src/bech32.cpp:38 in 9a505c1acf outdated

 209 | -    452, 710, 552, 128, 612, 600, 275, 322, 193
 210 | -};
 211 | -static_assert(std::size(GF1024_LOG) == 1024, "GF1024_EXP length should be 1024");
 212 | +/** We work with the finite field GF(1024) defined as a degree 2 extension of the base field GF(32)
 213 | + * The defining polynomial of the extension is x^2 + 9x + 23
 214 | + * Let (e) be a primitive element of GF(1024), that is, a generator of the field.

sipa commented at 9:10 PM on November 29, 2021:

This may be a bit of a philosophical point, touching on difference between specific field and field definition up to isomorphism.

Still, I don't think it's sufficient (or at least confusing) to say it is a primitive element of GF(1024). It is specifically one of the roots of x^2 + 9x + 23. Things work out well because this is a primitive polynomial, and thus its root in the extension field is indeed a generator, but it's perfectly possible to pick a non-primitive but irreducible polynomial to define the extension field, and then pick a primitive element of that field (which then wouldn't be a root of its modulus).

sipa commented at 9:16 PM on November 29, 2021:

How about this?

Define F=GF(32) as the set of polynomials over GF(2) modulo x^5 + x^3 + 1. We will represent elements of F as their integer encoding (e.g. 11 to mean x^3 + x + 1). Define E=GF(1024) as an extension of F, namely the set of polynomials over F modulo x^2+ 9x + 23. We will represent the element (ax + b) of F as the integer 32*a+b. Let e be the primitive element x of F (represented as 32). Every non-zero element of the field can then ...

meshcollider commented at 9:20 PM on November 29, 2021:

This is the definition of a primitive element. From Wikipedia:

In field theory, a primitive element of a finite field GF(q) is a generator of the multiplicative group of the field. In other words, α ∈ GF(q) is called a primitive element if it is a primitive (q − 1)th root of unity in GF(q); this means that each non-zero element of GF(q) can be written as α^i for some integer i.

But I agree this is more than just any primitive element, it is specifically because it is a root of the defining polynomial that we decide to use it. I will amend the comments.

in src/bech32.cpp:38 in 98124a6394 outdated

 209 | -    575, 992, 463, 983, 243, 360, 970, 350, 267, 615, 766, 494, 31, 1009,
 210 | -    452, 710, 552, 128, 612, 600, 275, 322, 193
 211 | -};
 212 | -static_assert(std::size(GF1024_LOG) == 1024, "GF1024_EXP length should be 1024");
 213 | +/** We work with the finite field GF(1024) defined as a degree 2 extension of the base field GF(32)
 214 | + * The defining polynomial of the extension is x^2 + 9x + 23.

sipa commented at 9:31 PM on November 29, 2021:

Thanks, though I still have a nit: you haven't defined an encoding for GF(32) elements, so the number 9 and 23 have no meaning.

in doc/release-notes-16807.md:6 in 98124a6394 outdated

   5 | -invalid characters in the address. For example, this will return the locations of up to two Bech32
   6 | -errors.
   7 | \ No newline at end of file
   8 | +- The `validateaddress` RPC now returns an `error_locations` array for invalid
   9 | +addresses, with the indices of invalid character locations in the address (if
  10 | +known). For example, this will return the locations of up to two Bech32 errors.

sipa commented at 10:29 PM on November 29, 2021:

Nit: there is really no way to know these are the actual error locations. They're a guess, except under the assumption that no more than 2 substitution errors have been made, in which case they're guaranteed to be correct.

in src/bech32.cpp:46 in 98124a6394 outdated

 217 | + * as (e)^k for some power k.
 218 | + * The array GF1024_EXP contains all these powers of (e) - GF1024_EXP[k] = (e)^k in GF(1024).
 219 | + * Conversely, GF1024_LOG contains the discrete logarithms of these powers, so
 220 | + * GF1024_LOG[GF1024_EXP[k]] == k.
 221 | + * The following function generates the two tables GF1024_EXP and GF1024_LOG as constexprs. */
 222 | +constexpr std::pair<std::array<int16_t, 1023>, std::array<int16_t, 1024>> generate_gf_tables()

sipa commented at 10:32 PM on November 29, 2021:

Nit: the generate functions here don't really match our coding style.

sipa commented at 10:48 PM on November 29, 2021: member

ACK 98124a63946d6f7d665864adbef35a863423ca47, with or without nits.

Update release note for bech32 error detection 63f7b69779

in src/bech32.cpp:102 in 98124a6394 outdated

 273 | +
 274 | +    return std::make_pair(GF1024_EXP, GF1024_LOG);
 275 | +}
 276 | +
 277 | +constexpr auto tables = generate_gf_tables();
 278 | +constexpr std::array<int16_t, 1023> GF1024_EXP = tables.first;

sipa commented at 11:06 PM on November 29, 2021:

I assume this doesn't actually materialize the tables twice, as tables would be an unused symbol in an anonymous namespace. Still, if you'd want to avoid the copying at compile-time (or rule out the risk that it gets emitted twice in the binary), these GF1024_EXP and GF1024_LOG variables could be references (e.g. constexpr const std::array<int16_t, 1023>&).

meshcollider commented at 12:41 AM on November 30, 2021: contributor

Updated to address all of @sipa's nits, and split the GF1024 constexpr commit into two:

the first keeps the table the same, and generates it at compile time.
the second modifies the generation to use the simpler encoding from sipa/bech32#64

in src/bech32.cpp:119 in 1e24b1d385 outdated

 290 | +    return std::make_pair(GF1024_EXP, GF1024_LOG);
 291 | +}
 292 | +
 293 | +constexpr auto tables = GenerateGFTables();
 294 | +constexpr const std::array<int16_t, 1023>& GF1024_EXP = tables.first;
 295 | +constexpr const std::array<int16_t, 1024>& GF1024_LOG = tables.second;

laanwj commented at 3:19 PM on November 30, 2021:

After the recent push I cannot find GF1024_EXP and GF1024_LOG as symbols anymore in the compiled output. I think that makes sense as they're references to tables now, which is fine, just need to adapt my comparison script.

laanwj commented at 3:43 PM on November 30, 2021: member

Thanks, re-checked:

gcc and clang output for tables of 1e24b1d385d64892389985afc058c83410c44311 matches the output of the sage script
gcc and clang output for tables of 83dc0b6392a99ea99e2d67d6893d715e48340023 matches GF1024_EXP + GF1024_LOG of the commit before it

Code review ACK 1e24b1d385d64892389985afc058c83410c44311

gruve-p commented at 3:51 PM on November 30, 2021: contributor

ACK https://github.com/bitcoin/bitcoin/pull/23577/commits/1e24b1d385d64892389985afc058c83410c44311

in src/bech32.cpp:255 in 1e24b1d385 outdated

 251 | +            SYNDROME_CONSTS[ind] = c;
 252 | +        }
 253 | +    }
 254 | +    return SYNDROME_CONSTS;
 255 | +}
 256 | +constexpr const std::array<uint32_t, 25>& SYNDROME_CONSTS = GenerateSyndromeConstants();

sipa commented at 3:58 PM on November 30, 2021:

Using a const reference here is counterproductive, it seems: the result appears to be that the reference is constexpr, but the value it is bound to isn't because it is a temporary, so the actual SYNDROME_CONSTS values aren't constexpr (I noticed when trying to add static_assertions on them).

meshcollider commented at 5:14 PM on November 30, 2021:

Ah okay, will revert this change then.

sipa commented at 3:59 PM on November 30, 2021: member

Verified the generated tables by adding the assertions generated by this Sage code:

# Binary field
B.<b> = GF(2)
# Polynomials over the binary field
BP.<bp> = B[]
# Encoder field GF(32)
F.<f> = GF(32, modulus=bp^5 + bp^3 + 1, repr='int')
# Polynomials over the encoder field
FP.<fp> = F[]
# Decoder field GF(1024)
E.<e> = F.extension(modulus=fp^2 + F.fetch_int(9)*fp + F.fetch_int(23))

# Convert an E field element to an integer 0..1023.
def gf_to_int(v):
    return v[0].integer_representation() + 32*v[1].integer_representation()

# Convert an integer 0..1023 to an E field element.
def int_to_gf(v):
    return F.fetch_int(v & 31) + e*F.fetch_int(v >> 5)

GF1024_EXP = [gf_to_int(e**i) for i in range(1023)]
GF1024_LOG = [-1] + [discrete_log(int_to_gf(i), e, ord=1023, operation="*") for i in range(1, 1024)]

for i in range(1023):
    print("static_assert(GF1024_EXP[%i] == %i);" % (i, GF1024_EXP[i]))
for i in range(1024):
    print("static_assert(GF1024_LOG[%i] == %i);" % (i, GF1024_LOG[i]))

SYNDROME_CONSTS = [None for _ in range(25)]
a1, a2, a3 = e**997, e**998, e**999
for k in range(1, 6):
    for b in range(5):
        c = (gf_to_int(a1**k * int_to_gf(2)**b) |
             gf_to_int(a2**k * int_to_gf(2)**b) << 10 |
             gf_to_int(a3**k * int_to_gf(2)**b) << 20)
        SYNDROME_CONSTS[(k-1)*5 + b] = c

for i in range(25):
    print("static_assert(SYNDROME_CONSTS[%i] == %i);" % (i, SYNDROME_CONSTS[i]))

Replace GF1024 tables and syndrome constants with compile-time generated constexprs. 14358a029d

Simplify encoding of e in GF(1024) tables to (1,0)

This follows PR 64 of the sipa/bech32 repo.

28d9c2857f

Use bounds-checked array lookups in Bech32 error detection code 405c96fc9f

Use std::iota instead of manually pushing range 2fa4fd1961

meshcollider commented at 10:03 PM on November 30, 2021: contributor

Only change:

- constexpr const std::array<uint32_t, 25>& SYNDROME_CONSTS = GenerateSyndromeConstants();
+ constexpr std::array<uint32_t, 25> SYNDROME_CONSTS = GenerateSyndromeConstants();

as per @sipa's comment.

sipa commented at 11:12 PM on November 30, 2021: member

ACK 2fa4fd196176160a5ad0a25da173ff93252b8103, verified the tables by inserting assertions generated by a Sage script.

in src/bech32.cpp:400 in 2fa4fd1961 outdated

 397 | @@ -404,7 +398,8 @@ DecodeResult Decode(const std::string& str) {
 398 |  /** Find index of an incorrect character in a Bech32 string. */
 399 |  std::string LocateErrors(const std::string& str, std::vector<int>& error_locations) {
 400 |      if (str.size() > 90) {

ryanofsky commented at 9:44 PM on December 3, 2021:

In commit "Use std::iota instead of manually pushing range" (2fa4fd196176160a5ad0a25da173ff93252b8103)

It would be good to add error_locations.clear() to the top of this function, if the intent is now to overwrite error_locations instead of appending to it. Before this commit, the function consistently appended without overwriting, and now it sometimes overwrites, sometimes appends, which is less consistent.

meshcollider commented at 11:50 PM on December 3, 2021:

Done in new commit to avoid invalidating the two ACKs.

laanwj commented at 8:18 AM on December 5, 2021:

Sorry to nit the nit, if the intent is to always return something new instead of modifying in place, why not return both values in the return value with e.g. std::pair or std::tuple instead? Personally at least I really prefer using the return value to return values instead of output arguments with references, it conveys the intent better.

meshcollider commented at 1:27 AM on December 6, 2021:

@laanwj no worries, I agree. Addressed your nit-nit :)

ryanofsky approved

ryanofsky commented at 9:49 PM on December 3, 2021: member

Code review ACK 2fa4fd196176160a5ad0a25da173ff93252b8103 with caveat that I didn't try to understand or verify the math here, I just checked that all comments describing the math matched up with the code, and that all the changes seemed like improvements.

gruve-p commented at 2:07 PM on December 5, 2021: contributor

ACK https://github.com/bitcoin/bitcoin/pull/23577/commits/353b46368069531e51ba6f8fafc41259968b965a

Make Bech32 LocateErrors return error list rather than using out-arg a4fe70171b

laanwj commented at 9:41 AM on December 6, 2021: member

Re-ACK a4fe70171b6fa570eda71d86b59d0fb24c2f0614

laanwj merged this on Dec 6, 2021

laanwj closed this on Dec 6, 2021

meshcollider deleted the branch on Dec 6, 2021

in src/bech32.cpp:399 in a4fe70171b

 395 | @@ -396,23 +396,28 @@ DecodeResult Decode(const std::string& str) {
 396 |  }
 397 |  
 398 |  /** Find index of an incorrect character in a Bech32 string. */
 399 | -std::string LocateErrors(const std::string& str, std::vector<int>& error_locations) {
 400 | +std::pair<std::string, std::vector<int>> LocateErrors(const std::string& str) {

MarcoFalke commented at 1:09 PM on December 6, 2021:

The docstring reads odd in doxygen:

"Find index of an incorrect character in a Bech32 string. Return the positions of errors in a Bech32 string. "

See https://doxygen.bitcoincore.org/namespacebech32.html#ad3d7336317a6e6d359b54b2f60b53acc

Also, std::make_pair is not needed with C++11/17

diff --git a/src/bech32.cpp b/src/bech32.cpp
index 3cda1dfff5..abc617467d 100644
--- a/src/bech32.cpp
+++ b/src/bech32.cpp
@@ -111,7 +111,7 @@ constexpr std::pair<std::array<int16_t, 1023>, std::array<int16_t, 1024>> Genera
         GF1024_LOG[v] = i;
     }
 
-    return std::make_pair(GF1024_EXP, GF1024_LOG);
+    return {GF1024_EXP, GF1024_LOG};
 }
 
 constexpr auto tables = GenerateGFTables();
@@ -395,27 +395,27 @@ DecodeResult Decode(const std::string& str) {
     return {result, std::move(hrp), data(values.begin(), values.end() - 6)};
 }
 
-/** Find index of an incorrect character in a Bech32 string. */
-std::pair<std::string, std::vector<int>> LocateErrors(const std::string& str) {
+std::pair<std::string, std::vector<int>> LocateErrors(const std::string& str)
+{
     std::vector<int> error_locations{};
 
     if (str.size() > 90) {
         error_locations.resize(str.size() - 90);
         std::iota(error_locations.begin(), error_locations.end(), 90);
-        return std::make_pair("Bech32 string too long", std::move(error_locations));
+        return {"Bech32 string too long", std::move(error_locations)};
     }
 
     if (!CheckCharacters(str, error_locations)){
-        return std::make_pair("Invalid character or mixed case", std::move(error_locations));
+        return {"Invalid character or mixed case", std::move(error_locations)};
     }
 
     size_t pos = str.rfind('1');
     if (pos == str.npos) {
-        return std::make_pair("Missing separator", std::vector<int>{});
+        return {"Missing separator", std::vector<int>{}};
     }
     if (pos == 0 || pos + 7 > str.size()) {
         error_locations.push_back(pos);
-        return std::make_pair("Invalid separator position", std::move(error_locations));
+        return {"Invalid separator position", std::move(error_locations)};
     }
 
     std::string hrp;
@@ -430,7 +430,7 @@ std::pair<std::string, std::vector<int>> LocateErrors(const std::string& str) {
         int8_t rev = CHARSET_REV[c];
         if (rev == -1) {
             error_locations.push_back(i);
-            return std::make_pair("Invalid Base 32 character", std::move(error_locations));
+            return {"Invalid Base 32 character", std::move(error_locations)};
         }
         values[i - pos - 1] = rev;
     }
@@ -550,7 +550,7 @@ std::pair<std::string, std::vector<int>> LocateErrors(const std::string& str) {
             }
         } else {
             // No errors
-            return std::make_pair("", std::vector<int>{});
+            return {"", std::vector<int>{}};
         }
 
         if (error_locations.empty() || (!possible_errors.empty() && possible_errors.size() < error_locations.size())) {
@@ -562,7 +562,7 @@ std::pair<std::string, std::vector<int>> LocateErrors(const std::string& str) {
                               : error_encoding == Encoding::BECH32 ? "Invalid Bech32 checksum"
                               : "Invalid checksum";
 
-    return std::make_pair(error_message, std::move(error_locations));
+    return {error_message, std::move(error_locations)};
 }
 
 } // namespace bech32
diff --git a/src/bech32.h b/src/bech32.h
index 5e89e6efda..707299f19e 100644
--- a/src/bech32.h
+++ b/src/bech32.h
@@ -45,7 +45,7 @@ struct DecodeResult
 /** Decode a Bech32 or Bech32m string. */
 DecodeResult Decode(const std::string& str);
 
-/** Return the positions of errors in a Bech32 string. */
+/** Return the positions of errors in a Bech32 string. (Empty string if no errors can be found). */
 std::pair<std::string, std::vector<int>> LocateErrors(const std::string& str);
 
 } // namespace bech32

MarcoFalke commented at 1:09 PM on December 6, 2021: member

left a nit

MarcoFalke commented at 1:56 PM on December 7, 2021: member

More nits:

There are two lines missing test coverage: https://marcofalke.github.io/btc_cov/total.coverage/src/bech32.cpp.gcov.html#542
Maybe update the DecodeDestination fuzz test with error locations?

sidhujag referenced this in commit 67b14190f6 on Dec 7, 2021

RandyMcMillan referenced this in commit f1d0e97733 on Dec 23, 2021

DrahtBot locked this on Dec 7, 2022