A number of follow-ups and improvements to the bech32 error location code, introduced in #16807.
Notably, this removes the hardcoded GF1024 tables in favour of constexpr table generation.
A number of follow-ups and improvements to the bech32 error location code, introduced in #16807.
Notably, this removes the hardcoded GF1024 tables in favour of constexpr table generation.
@ryanofsky and @sipa, this addresses most of your review comments in #16807 so you may like to take a look, thanks!
Notably, this removes the hardcoded GF1024 tables in favour of constexpr table generation.
Wow, this makes the compiler generate the table? Agree this is better for understandability, just wonder, does this have any noticeable impact on compilation time?
@laanwj Wow, this makes the compiler generate the table? Agree this is better for understandability, just wonder, does this have any noticeable impact on compilation time?
It shouldn't do, each element of the table only takes a few bitwise operations on 16 bit integers and the table is only 1024 elements.
EDIT: a quick test compiling that file using clang takes 0.24 on master and 0.28 with this PR so it is pretty negligible overall.
I checked the assembly output from clang 13.0.0 and gcc 9.3.0 that the tables are indeed a) generated constants in .rodata and b) different from before (as expected and mentioned in the commit description) c) match between compilers.
.type _ZN6bech3212_GLOBAL__N_110GF1024_LOGE,@object # @_ZN6bech3212_GLOBAL__N_110GF1024_LOGE
.section .rodata,"a",@progbits
.p2align 1
_ZN6bech3212_GLOBAL__N_110GF1024_LOGE:
.short 65535 # 0xffff
.short 0 # 0x0
.short 99 # 0x63
.short 363 # 0x16b
.short 198 # 0xc6
⋮
.type _ZN6bech3212_GLOBAL__N_110GF1024_EXPE,@object # @_ZN6bech3212_GLOBAL__N_110GF1024_EXPE
.p2align 1
_ZN6bech3212_GLOBAL__N_110GF1024_EXPE:
.short 1 # 0x1
.short 32 # 0x20
.short 311 # 0x137
.short 139 # 0x8b
.short 206 # 0xce
⋮
Full values for both tables can be found here: https://0bin.net/paste/Lp6hKmVD#EwTuc1KFIDzy57g0g3EvtQiLIj3HWD4e+OGPWUJTwQ+ (I did not have anything to check them against).
(I did not have anything to check them against).
Looks like the GF1024_EXP you posted differs from the one that is removed in this pull?
I will review in detail shortly, but I suspect that @meshcollider perhaps included the table changes corresponding to https://github.com/sipa/bech32/pull/64 here? There is Sage code there that generates the tables too, if someone wants something to compare against.
Looks like the GF1024_EXP you posted differs from the one that is removed in this pull?
Both tables are. From the commit:
Note that the tables generated by this code are different to the previous hardcoded tables, because we simplify to encoding (e) as 1 || 0 rather than 9 || 15 (as was done in PR 64 of the bech32 repo).
I began this test with the intent to compare if they were the same, then noticed they were not. But it's expected (they use a different encoding).
Edit: I'll check against @sipa's sage output. Edit2: All values match. The only difference is that the size of GF1024_EXP printed by the Sage script is 1024, while the structure here is 1023 long.
I'm happy to split that commit into two, the first deriving the existing table and then the second modifying to the new table, if that would help review?
The length difference (1024 vs 1023) is just because the sage code also includes (a)^1023 = 1 = (a)^0.
209 | - 452, 710, 552, 128, 612, 600, 275, 322, 193 210 | -}; 211 | -static_assert(std::size(GF1024_LOG) == 1024, "GF1024_EXP length should be 1024"); 212 | +/** We work with the finite field GF(1024) defined as a degree 2 extension of the base field GF(32) 213 | + * The defining polynomial of the extension is x^2 + 9x + 23 214 | + * Let (e) be a primitive element of GF(1024), that is, a generator of the field.
This may be a bit of a philosophical point, touching on difference between specific field and field definition up to isomorphism.
Still, I don't think it's sufficient (or at least confusing) to say it is a primitive element of GF(1024). It is specifically one of the roots of x^2 + 9x + 23. Things work out well because this is a primitive polynomial, and thus its root in the extension field is indeed a generator, but it's perfectly possible to pick a non-primitive but irreducible polynomial to define the extension field, and then pick a primitive element of that field (which then wouldn't be a root of its modulus).
How about this?
Define F=GF(32) as the set of polynomials over GF(2) modulo x^5 + x^3 + 1. We will represent elements of F as their integer encoding (e.g. 11 to mean x^3 + x + 1). Define E=GF(1024) as an extension of F, namely the set of polynomials over F modulo x^2+ 9x + 23. We will represent the element (ax + b) of F as the integer 32*a+b. Let e be the primitive element x of F (represented as 32). Every non-zero element of the field can then ...
This is the definition of a primitive element. From Wikipedia:
In field theory, a primitive element of a finite field GF(q) is a generator of the multiplicative group of the field. In other words, α ∈ GF(q) is called a primitive element if it is a primitive (q − 1)th root of unity in GF(q); this means that each non-zero element of GF(q) can be written as α^i for some integer i.
But I agree this is more than just any primitive element, it is specifically because it is a root of the defining polynomial that we decide to use it. I will amend the comments.
209 | - 575, 992, 463, 983, 243, 360, 970, 350, 267, 615, 766, 494, 31, 1009, 210 | - 452, 710, 552, 128, 612, 600, 275, 322, 193 211 | -}; 212 | -static_assert(std::size(GF1024_LOG) == 1024, "GF1024_EXP length should be 1024"); 213 | +/** We work with the finite field GF(1024) defined as a degree 2 extension of the base field GF(32) 214 | + * The defining polynomial of the extension is x^2 + 9x + 23.
Thanks, though I still have a nit: you haven't defined an encoding for GF(32) elements, so the number 9 and 23 have no meaning.
5 | -invalid characters in the address. For example, this will return the locations of up to two Bech32 6 | -errors. 7 | \ No newline at end of file 8 | +- The `validateaddress` RPC now returns an `error_locations` array for invalid 9 | +addresses, with the indices of invalid character locations in the address (if 10 | +known). For example, this will return the locations of up to two Bech32 errors.
Nit: there is really no way to know these are the actual error locations. They're a guess, except under the assumption that no more than 2 substitution errors have been made, in which case they're guaranteed to be correct.
217 | + * as (e)^k for some power k. 218 | + * The array GF1024_EXP contains all these powers of (e) - GF1024_EXP[k] = (e)^k in GF(1024). 219 | + * Conversely, GF1024_LOG contains the discrete logarithms of these powers, so 220 | + * GF1024_LOG[GF1024_EXP[k]] == k. 221 | + * The following function generates the two tables GF1024_EXP and GF1024_LOG as constexprs. */ 222 | +constexpr std::pair<std::array<int16_t, 1023>, std::array<int16_t, 1024>> generate_gf_tables()
Nit: the generate functions here don't really match our coding style.
ACK 98124a63946d6f7d665864adbef35a863423ca47, with or without nits.
273 | + 274 | + return std::make_pair(GF1024_EXP, GF1024_LOG); 275 | +} 276 | + 277 | +constexpr auto tables = generate_gf_tables(); 278 | +constexpr std::array<int16_t, 1023> GF1024_EXP = tables.first;
I assume this doesn't actually materialize the tables twice, as tables would be an unused symbol in an anonymous namespace. Still, if you'd want to avoid the copying at compile-time (or rule out the risk that it gets emitted twice in the binary), these GF1024_EXP and GF1024_LOG variables could be references (e.g. constexpr const std::array<int16_t, 1023>&).
Updated to address all of @sipa's nits, and split the GF1024 constexpr commit into two:
290 | + return std::make_pair(GF1024_EXP, GF1024_LOG); 291 | +} 292 | + 293 | +constexpr auto tables = GenerateGFTables(); 294 | +constexpr const std::array<int16_t, 1023>& GF1024_EXP = tables.first; 295 | +constexpr const std::array<int16_t, 1024>& GF1024_LOG = tables.second;
After the recent push I cannot find GF1024_EXP and GF1024_LOG as symbols anymore in the compiled output. I think that makes sense as they're references to tables now, which is fine, just need to adapt my comparison script.
Thanks, re-checked:
tables of 1e24b1d385d64892389985afc058c83410c44311 matches the output of the sage scripttables of 83dc0b6392a99ea99e2d67d6893d715e48340023 matches GF1024_EXP + GF1024_LOG of the commit before itCode review ACK 1e24b1d385d64892389985afc058c83410c44311
251 | + SYNDROME_CONSTS[ind] = c; 252 | + } 253 | + } 254 | + return SYNDROME_CONSTS; 255 | +} 256 | +constexpr const std::array<uint32_t, 25>& SYNDROME_CONSTS = GenerateSyndromeConstants();
Using a const reference here is counterproductive, it seems: the result appears to be that the reference is constexpr, but the value it is bound to isn't because it is a temporary, so the actual SYNDROME_CONSTS values aren't constexpr (I noticed when trying to add static_assertions on them).
Ah okay, will revert this change then.
Verified the generated tables by adding the assertions generated by this Sage code:
# Binary field
B.<b> = GF(2)
# Polynomials over the binary field
BP.<bp> = B[]
# Encoder field GF(32)
F.<f> = GF(32, modulus=bp^5 + bp^3 + 1, repr='int')
# Polynomials over the encoder field
FP.<fp> = F[]
# Decoder field GF(1024)
E.<e> = F.extension(modulus=fp^2 + F.fetch_int(9)*fp + F.fetch_int(23))
# Convert an E field element to an integer 0..1023.
def gf_to_int(v):
return v[0].integer_representation() + 32*v[1].integer_representation()
# Convert an integer 0..1023 to an E field element.
def int_to_gf(v):
return F.fetch_int(v & 31) + e*F.fetch_int(v >> 5)
GF1024_EXP = [gf_to_int(e**i) for i in range(1023)]
GF1024_LOG = [-1] + [discrete_log(int_to_gf(i), e, ord=1023, operation="*") for i in range(1, 1024)]
for i in range(1023):
print("static_assert(GF1024_EXP[%i] == %i);" % (i, GF1024_EXP[i]))
for i in range(1024):
print("static_assert(GF1024_LOG[%i] == %i);" % (i, GF1024_LOG[i]))
SYNDROME_CONSTS = [None for _ in range(25)]
a1, a2, a3 = e**997, e**998, e**999
for k in range(1, 6):
for b in range(5):
c = (gf_to_int(a1**k * int_to_gf(2)**b) |
gf_to_int(a2**k * int_to_gf(2)**b) << 10 |
gf_to_int(a3**k * int_to_gf(2)**b) << 20)
SYNDROME_CONSTS[(k-1)*5 + b] = c
for i in range(25):
print("static_assert(SYNDROME_CONSTS[%i] == %i);" % (i, SYNDROME_CONSTS[i]))
This follows PR 64 of the sipa/bech32 repo.
Only change:
- constexpr const std::array<uint32_t, 25>& SYNDROME_CONSTS = GenerateSyndromeConstants();
+ constexpr std::array<uint32_t, 25> SYNDROME_CONSTS = GenerateSyndromeConstants();
as per @sipa's comment.
ACK 2fa4fd196176160a5ad0a25da173ff93252b8103, verified the tables by inserting assertions generated by a Sage script.
397 | @@ -404,7 +398,8 @@ DecodeResult Decode(const std::string& str) {
398 | /** Find index of an incorrect character in a Bech32 string. */
399 | std::string LocateErrors(const std::string& str, std::vector<int>& error_locations) {
400 | if (str.size() > 90) {
In commit "Use std::iota instead of manually pushing range" (2fa4fd196176160a5ad0a25da173ff93252b8103)
It would be good to add error_locations.clear() to the top of this function, if the intent is now to overwrite error_locations instead of appending to it. Before this commit, the function consistently appended without overwriting, and now it sometimes overwrites, sometimes appends, which is less consistent.
Done in new commit to avoid invalidating the two ACKs.
Sorry to nit the nit, if the intent is to always return something new instead of modifying in place, why not return both values in the return value with e.g. std::pair or std::tuple instead? Personally at least I really prefer using the return value to return values instead of output arguments with references, it conveys the intent better.
@laanwj no worries, I agree. Addressed your nit-nit :)
Code review ACK 2fa4fd196176160a5ad0a25da173ff93252b8103 with caveat that I didn't try to understand or verify the math here, I just checked that all comments describing the math matched up with the code, and that all the changes seemed like improvements.
Re-ACK a4fe70171b6fa570eda71d86b59d0fb24c2f0614
395 | @@ -396,23 +396,28 @@ DecodeResult Decode(const std::string& str) { 396 | } 397 | 398 | /** Find index of an incorrect character in a Bech32 string. */ 399 | -std::string LocateErrors(const std::string& str, std::vector<int>& error_locations) { 400 | +std::pair<std::string, std::vector<int>> LocateErrors(const std::string& str) {
The docstring reads odd in doxygen:
"Find index of an incorrect character in a Bech32 string. Return the positions of errors in a Bech32 string. "
See https://doxygen.bitcoincore.org/namespacebech32.html#ad3d7336317a6e6d359b54b2f60b53acc
Also, std::make_pair is not needed with C++11/17
diff --git a/src/bech32.cpp b/src/bech32.cpp
index 3cda1dfff5..abc617467d 100644
--- a/src/bech32.cpp
+++ b/src/bech32.cpp
@@ -111,7 +111,7 @@ constexpr std::pair<std::array<int16_t, 1023>, std::array<int16_t, 1024>> Genera
GF1024_LOG[v] = i;
}
- return std::make_pair(GF1024_EXP, GF1024_LOG);
+ return {GF1024_EXP, GF1024_LOG};
}
constexpr auto tables = GenerateGFTables();
@@ -395,27 +395,27 @@ DecodeResult Decode(const std::string& str) {
return {result, std::move(hrp), data(values.begin(), values.end() - 6)};
}
-/** Find index of an incorrect character in a Bech32 string. */
-std::pair<std::string, std::vector<int>> LocateErrors(const std::string& str) {
+std::pair<std::string, std::vector<int>> LocateErrors(const std::string& str)
+{
std::vector<int> error_locations{};
if (str.size() > 90) {
error_locations.resize(str.size() - 90);
std::iota(error_locations.begin(), error_locations.end(), 90);
- return std::make_pair("Bech32 string too long", std::move(error_locations));
+ return {"Bech32 string too long", std::move(error_locations)};
}
if (!CheckCharacters(str, error_locations)){
- return std::make_pair("Invalid character or mixed case", std::move(error_locations));
+ return {"Invalid character or mixed case", std::move(error_locations)};
}
size_t pos = str.rfind('1');
if (pos == str.npos) {
- return std::make_pair("Missing separator", std::vector<int>{});
+ return {"Missing separator", std::vector<int>{}};
}
if (pos == 0 || pos + 7 > str.size()) {
error_locations.push_back(pos);
- return std::make_pair("Invalid separator position", std::move(error_locations));
+ return {"Invalid separator position", std::move(error_locations)};
}
std::string hrp;
@@ -430,7 +430,7 @@ std::pair<std::string, std::vector<int>> LocateErrors(const std::string& str) {
int8_t rev = CHARSET_REV[c];
if (rev == -1) {
error_locations.push_back(i);
- return std::make_pair("Invalid Base 32 character", std::move(error_locations));
+ return {"Invalid Base 32 character", std::move(error_locations)};
}
values[i - pos - 1] = rev;
}
@@ -550,7 +550,7 @@ std::pair<std::string, std::vector<int>> LocateErrors(const std::string& str) {
}
} else {
// No errors
- return std::make_pair("", std::vector<int>{});
+ return {"", std::vector<int>{}};
}
if (error_locations.empty() || (!possible_errors.empty() && possible_errors.size() < error_locations.size())) {
@@ -562,7 +562,7 @@ std::pair<std::string, std::vector<int>> LocateErrors(const std::string& str) {
: error_encoding == Encoding::BECH32 ? "Invalid Bech32 checksum"
: "Invalid checksum";
- return std::make_pair(error_message, std::move(error_locations));
+ return {error_message, std::move(error_locations)};
}
} // namespace bech32
diff --git a/src/bech32.h b/src/bech32.h
index 5e89e6efda..707299f19e 100644
--- a/src/bech32.h
+++ b/src/bech32.h
@@ -45,7 +45,7 @@ struct DecodeResult
/** Decode a Bech32 or Bech32m string. */
DecodeResult Decode(const std::string& str);
-/** Return the positions of errors in a Bech32 string. */
+/** Return the positions of errors in a Bech32 string. (Empty string if no errors can be found). */
std::pair<std::string, std::vector<int>> LocateErrors(const std::string& str);
} // namespace bech32
left a nit
More nits:
DecodeDestination fuzz test with error locations?
Milestone
23.0