Synthetic int128 type.

roconnor-blockstream commented at 2:41 pm on October 26, 2021: contributor

Abstracts the int128 type and provides an native version, if available, or a implements it using a pair of int64_t’s.

This is activated by setting the configuration flag --with-test-override-wide-multiply=int128_struct.

The primary purpose of this PR is to take advantage of MSVC’s umulh intrinsic that we can use to simulate an int128 type which MSVC does not have (AFAIU). This PR lays out the groundwork for this level of MSVC support, but doesn’t include the configuration logic to enable it yet.

For completeness, and implementation of umulh and mulh are also provided for compilers that support neither the intrinsic nor the int128 type (such as CompCert?). This also opens up the possibility of removing the 32-bit field and scalar implementations should that ever be desired.

in src/int128.h:25 in 4c684ff8ff outdated

20+typedef secp256k1_uint128 secp256k1_int128;
21+#endif
22+
23+/* Low 32 bits of a (u)int64_t as an uint64_t. */
24+#define LO32(x) ((uint64_t)(x) & 0xffffffff)
25+#define HI32(x) ((x) >> 32)

peterdettman commented at 10:20 am on October 27, 2021:

Why the uint64_t cast for LO32 but not for HI32?

roconnor-blockstream commented at 2:59 pm on October 27, 2021:

I’m planning to remove these macros soon, but the idea was that I wanted to specifically use signed/unsigned shifts (depending on the type), while otherwise I generally used unsigned for bit-twiddling.

in src/modinv64_impl.h:354 in 4c684ff8ff outdated

350+    VERIFY_CHECK((secp256k1_i128_to_i64(&ce) & M62) == 0); secp256k1_i128_rshift(&ce, 62);
351     /* Compute limb 1 of t*[d,e]+modulus*[md,me], and store it as output limb 0 (= down shift). */
352-    cd += (int128_t)u * d1 + (int128_t)v * e1;
353-    ce += (int128_t)q * d1 + (int128_t)r * e1;
354+    secp256k1_i128_accum_mul(&cd, u, d1);
355+    secp256k1_i128_accum_mul(&cd, v, e1);

peterdettman commented at 10:29 am on October 27, 2021:

I just want to note that accumulating to cd twice requires different bounds analysis (which I haven’t done) than summing then accumulating, although perhaps the C compiler was always free to rearrange it that way anyway?

Same occurs repeatedly below, obviously.

roconnor-blockstream commented at 3:00 pm on October 27, 2021:

oh, you are saying that by re associating the sums here, I may be inducing an int128 overflow?

I’ll have to look into that.

(Thanks, this is exactly the sort of feedback I was looking for in this early draft.)

peterdettman commented at 3:23 pm on October 27, 2021:

I’m only saying that I didn’t analyse it in its current form. I doubt there is an issue, but it should be checked. Though as I say, perhaps the C compiler was always free to rearrange it.

roconnor-blockstream commented at 3:46 pm on October 27, 2021:

I really hope the C compiler isn’t allowed to keep trying to reassociate my operations in order to find some arrangement that invokes UB and thus allow itself to emit code to format my hard drive.

sipa commented at 4:03 pm on October 27, 2021:

Of course not.

UB is defined in terms of the source code, not the compilation result.

Rearranging is something the compiler does long after that point, following the “as if” rule.

real-or-random commented at 11:13 am on February 7, 2022:

Could we just introduce a temporary variable and sum up there first? Or better, add a secp256k1_i128_accum_mul_twice that will accumulate only in the int128 impl? This will save us the reviewer time here. I expect the compiler to prune it anyway.

(Or has anyone redone the bound analysis? I guess it’s not hard but I’d need to refresh my memory here.)

real-or-random commented at 1:18 pm on July 11, 2022:

(resolving, see later comments in the main thread)

roconnor-blockstream force-pushed on Nov 1, 2021

in src/int128.h:14 in 3bf06c364e outdated

 9+#undef UINT128_MAX
10+*/
11+
12+#if defined(UINT128_MAX)
13+typedef uint128_t secp256k1_uint128;
14+typedef int128_t secp256k1_int128;

real-or-random commented at 12:49 pm on November 3, 2021:

I think it’s a good idea to define our our own type with a separate prefix. I guess then the entire logic currently in util.h will conceptually belong here.

roconnor-blockstream commented at 1:22 pm on November 3, 2021:

I am inclined to move the int128 logic from util.h into this file. I just haven’t done it yet.

real-or-random commented at 11:45 am on December 11, 2021:

This hasn’t been resolved yet.

roconnor-blockstream commented at 3:15 pm on December 11, 2021:

Are you not happy with the changes to util.h?

real-or-random commented at 4:59 pm on December 11, 2021:

I’d prefer to move the remaining #ifdef WIDEMUL logic in util.h to int128_impl.h (or int128.h?) because it belongs there conceptually.

I think util.h is just the kitchen sink for code pieces that don’t have a proper home (or where we were too lazy to create a proper home. Now that you introduce int128_impl.h here, I think that’s the place where those #ifdefs belong.

roconnor-blockstream commented at 5:28 pm on December 11, 2021:

As it stands when SECP256K1_WIDEMUL_INT64 ends up being defined then int128_impl.h isn’t even imported anywhere. So I’m not sure what you are imagining.

real-or-random commented at 1:50 pm on February 1, 2022:

Yes, I was wrong. See below though.

real-or-random commented at 12:58 pm on November 3, 2021: contributor

I think it would help readability to arrange the different implementations of the type in different files, similar to what we have a field and scalar “modules”, where we have implementation files such as scalar4x64_impl.h but still a single scalar.h that makes sure that all function prototypes are identical. Then you could remove a lot of the ifdefs within the functions.

roconnor-blockstream force-pushed on Nov 8, 2021

roconnor-blockstream force-pushed on Nov 15, 2021

roconnor-blockstream force-pushed on Dec 7, 2021

roconnor-blockstream renamed this:
~~WIP: Synthetic int128 type.~~
Synthetic int128 type.
on Dec 7, 2021

roconnor-blockstream marked this as ready for review on Dec 7, 2021

roconnor-blockstream commented at 8:42 pm on December 7, 2021: contributor

I’m moving this out of draft stage. The coding is complete. There are a few tasks that remain.

Double check that there is no signed integer overflow due to rearrangement of the order of some operations.
We need someone to build this on MSVC.
Make sure that the refactored code for the native int128 type isn’t any slower.

To compile with the synthetic int128 type, you currently need to pass a configuration flag --with-test-override-wide-multiply=int128_struct.

roconnor-blockstream force-pushed on Dec 7, 2021

real-or-random commented at 7:13 pm on December 9, 2021: contributor

* [ ]  We need someone to build this on MSVC.

I have built this with MSVC 2019 on wine (https://github.com/mstorsjo/msvc-wine) and it works. Tests and exhaustive tests pass.

For reference:

0 ./configure --disable-benchmark --with-test-override-wide-multiply=int128_struct CC=/opt/msvc/bin/x64/cl.exe CFLAGS="-Za -O2 -w" LD=/opt/msvc/bin/x64/link.exe

For the Arch Linux users, there’s an AUR package: https://aur.archlinux.org/packages/msvc-wine-git/

sipa commented at 7:14 pm on December 9, 2021: contributor

FWIW, bitcoin core has an AppVeyor MSVC/Windows CI environment. Maybe it’s worth looking into adding one for libsecp256k1 directly.

real-or-random commented at 11:06 am on December 10, 2021: contributor

Indeed. Let me note that also Cirrus apparently offers Windows though I have no idea if it’s good and/or reliable: https://cirrus-ci.org/guide/windows/

in src/int128_struct_impl.h:24 in c8d49e2710 outdated

 5+
 6+#if defined(_M_X64) | defined(_M_ARM64) | defined(_WIN64) /* MSVC */
 7+  #include <intrin.h>
 8+  #define secp256k1_umulh __umulh
 9+  #define secp256k1_mulh __mulh
10+#else

real-or-random commented at 12:25 pm on December 10, 2021:

0#if defined(_M_X64) | defined(_M_ARM64) | defined(_WIN64) /* MSVC */
1  #error
2  #include <intrin.h>
3  #define secp256k1_umulh __umulh
4  #define secp256k1_mulh __mulh
5#else

Follow-up: I confirmed that change results in an error in the the MSVC build, so the intrinsics should indeed be used.

in src/int128.h:20 in c8d49e2710 outdated

15+#error "Please select int128 implementation"
16+#endif
17+
18+static SECP256K1_INLINE uint64_t secp256k1_umulh(uint64_t a, uint64_t b);
19+
20+static SECP256K1_INLINE int64_t secp256k1_mulh(int64_t a, int64_t b);

sipa commented at 7:13 pm on December 10, 2021:

This function does not appear to be used outside of src/int128_struct_impl.h, perhaps it can be considered a private implementation detail there, rather than exposing it?

roconnor-blockstream commented at 10:10 pm on January 31, 2022:

Done.

in src/int128.h:43 in c8d49e2710 outdated

38+
39+static SECP256K1_INLINE void secp256k1_i128_mul(secp256k1_int128 *r, int64_t a, int64_t b);
40+
41+static SECP256K1_INLINE void secp256k1_i128_accum_mul(secp256k1_int128 *r, int64_t a, int64_t b);
42+
43+static SECP256K1_INLINE void secp256k1_i128_dissip_mul(secp256k1_int128 *r, int64_t a, int64_t b);

sipa commented at 7:14 pm on December 10, 2021:

This function does not appear to be used outside of src/int128_struct_impl.h, perhaps it can be considered a private implementation detail there, rather than exposing it?

roconnor-blockstream commented at 10:10 pm on January 31, 2022:

Done.

in src/scalar_4x64_impl.h:213 in c8d49e2710 outdated

222-    { \
223-        uint128_t t = (uint128_t)a * b; \
224-        th = t >> 64;         /* at most 0xFFFFFFFFFFFFFFFE */ \
225-        tl = t; \
226-    } \
227+    uint64_t th = secp256k1_umulh(a, b);  /* at most 0xFFFFFFFFFFFFFFFE */ \

sipa commented at 7:20 pm on December 10, 2021:

Really the th and tl computed here (and below) are the top and bottom of secp256k1_uint128. Perhaps it should be written in function of the actual secp256k1_u128_... functions, instead of the lower-level secp256k1_umulh? This would have as advantages:

Not relying on the compiler to recognize that a * b is computed twice and can be deduplicated (in native mode).
Permit removing secp256k1_umulh from the header.

roconnor-blockstream commented at 10:10 pm on January 31, 2022:

Done.

roconnor-blockstream force-pushed on Jan 31, 2022

real-or-random commented at 1:54 pm on February 1, 2022: contributor

My above comment about the inclusion stuff was wrong but here’s a cleaner version: https://github.com/real-or-random/secp256k1/commits/202201-int128-includes

The first commit simply fixes naming of header guards and should belong to this PR.

The second commit changes to code adheres to what I wrote in #1039:

After this commit, int128.h and int128_impl.h are included as follows:

.c files which use int128 include int128_impl.h (after util.h)
.h files which use int128 include int128.h (after util.h)

This list is exhaustive. util.h needs to included first because it sets up necessary #defines.

If you want, please pick the second commit, too. Or if you don’t want to deal with the C mess, I can create a PR with the second commit on top of yours, and we fix this once your PR has been merged.

roconnor-blockstream commented at 10:35 pm on February 14, 2022: contributor

I notice that your PR still keeps the USE_*_WIDEMUL_* in util.h. it still seems we cannot really move it out of util.h because it both selects between INT128_STRUCT and INT128_NATIVE, but can also select WIDEMUL_INT64 which isn’t really INT128 related.

More specifically this block of CPP code is interpreting the wideMultiply configuration option, which I’ve designed to have 3 options (by adding that third option).

roconnor-blockstream force-pushed on Feb 14, 2022

real-or-random commented at 1:36 pm on February 17, 2022: contributor

I notice that your PR still keeps the USE_*_WIDEMUL_* in util.h. it still seems we cannot really move it out of util.h because it both selects between INT128_STRUCT and INT128_NATIVE, but can also select WIDEMUL_INT64 which isn’t really INT128 related.

Right. I think that’s fine.

So this means we have a couple of input #defines (preset in the compiler or set by autoconf, e..g, USE_*_WIDEMUL_*) and we have some CPP logic in file that turns those inputs into “nice” #ifdefs that can then be used in the rest of the code base, e.g., INT128_NATIVE. I believe that’s a reasonable way of doing things.

(In the future we should move the CPP logic from util.h to a separate file. It somehow ended up in util.h though it does not belong there…)

roconnor-blockstream commented at 6:58 pm on February 28, 2022: contributor

I’ve reviewed the changes in the association of + for signed int128 calculations to look for changes in overflow behavior.

Signed int128 is only used in the src/modinv64_impl.h file. The only changes in association occur in the functions:

secp256k1_modinv64_update_de_62
secp256k1_modinv64_update_fg_62
secp256k1_modinv64_update_fg_62_var

The values being summed are all of the form (int128_t)u * ??? + (int128_t)v * ??? (or similarly for q and r). The only way the new association could cause an overflow (or underflow) is if this previous calculation depended on some cancellation within the above addition term to prevent the final accumulation from itself overflowing. However, the previous calculation does not depend on any cancellation. The only assumed constraint on u and v (resp. q and r) is that |u|+|v| <= 2^62, which implies nothing about the values of u and v causing cancellation.

My conclusion, after the above and re-reviewing the above functions, is that neither the new (or old) order of addition risks overflow.

real-or-random commented at 10:22 am on March 1, 2022: contributor

Indeed, we know nothing about the signs of the summands. To convince myself, I also redid the bounds analysis:

Counting value bits (i.e., bits that are not the sign bit):

For each step, cd starts with at most 52 bits. Then we accumulate (int128_t)u * ???, where the factors have at most 62 bits (implied by |u|+|v| <= 2^62*) and 63 bits (trivial bound for int64_t). Same with (int128_t)v * e0. Then we accumulate (int128_t) modinfo->modulus.??? * md, where the factors have at most 62 (modulus is in signed62) and 63 value bits.

We work with signed, so the value with the maximum absolute value representable in B bits is -(1<<B) (with absolute 1<<B).
This means the maximum absolute value cd can take is (1<<52) + (1<<62) * (1<<63) + (1<<62) * (1<<63) + (1<<62) * (1<<63) which is a 127 bit value and thus reprensentable in 127 value bits.

Same is true for ce.

*except u==2^62, which needs 63 bits but then v==0, which also works: (1<<52) + (1<<63) * (1<<63) + 0 * (1<<63) + (1<<62) * (1<<63) is also a 127 bit value.

in src/int128.h:51 in b0d2fe0d9e outdated

46+
47+static SECP256K1_INLINE void secp256k1_i128_from_i64(secp256k1_int128 *r, int64_t a);
48+
49+static SECP256K1_INLINE int secp256k1_i128_eq(const secp256k1_int128 *a, const secp256k1_int128 *b);
50+
51+static SECP256K1_INLINE int secp256k1_i128_check_bit(secp256k1_int128 *r, unsigned int n);

real-or-random commented at 10:43 am on March 1, 2022:

0static SECP256K1_INLINE int secp256k1_i128_check_bit(const secp256k1_int128 *a, unsigned int n);

same for u128 version

roconnor-blockstream commented at 9:17 pm on March 7, 2022:

done.

in src/int128_struct_impl.h:6 in b0d2fe0d9e outdated

0@@ -0,0 +1,133 @@
1+#ifndef SECP256K1_INT128_STRUCT_IMPL_H
2+#define SECP256K1_INT128_STRUCT_IMPL_H
3+
4+#include "int128.h"
5+
6+#if defined(_M_X64) | defined(_M_ARM64) | defined(_WIN64) /* MSVC */

real-or-random commented at 11:12 am on March 1, 2022:

Hm, I’m not sure about _WIN64. This is defined by icc for example, and as I understand also by mingw. (The Internet says that mingw, which is basically “gcc for windows”, will support the MSVC intrinsicts, but it will also have a native 128 bit, so we’ll probably want to use the latter.)

MSVC is what provides those intrinsics and _MSC_VER should really only be defined on MSVC. What about one of these two?

0#if defined(_MSC_VER) && defined(_WIN64)

or

0#if defined(_MSC_VER) && (defined(_M_X64) || defined(_M_ARM64))

Currently, the two versions should be equivalent. This may change if MSVC starts to support another 64 bit target in the future, and then I believe it is likely that __mulh und __umulh will exist, so I lean for the first option. But anyway, when that change happens to MSVC, we can deal with it…

Parts of my comment are guesswork so far, I believe we’ll need to test compilation.

roconnor-blockstream commented at 7:02 pm on March 7, 2022:

I’m inclined to go with #if defined(_MSC_VER) && (defined(_M_X64) || defined(_M_ARM64))

roconnor-blockstream commented at 9:16 pm on March 7, 2022:

done.

real-or-random commented at 11:17 am on March 1, 2022: contributor

@roconnor-blockstream Can you rebase this? This will ease benchmarking against master.

real-or-random commented at 11:37 am on March 1, 2022: contributor

Native 128bit performance looks good:

$ SECP256K1_BENCH_ITERS=1000000 ./bench_internal inverse

 0gcc 11.2, pr:
 1
 2Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)
 3
 4scalar_inverse                ,     2.40      ,     2.41      ,     2.42   
 5scalar_inverse_var            ,     1.74      ,     1.75      ,     1.81   
 6field_inverse                 ,     2.38      ,     2.39      ,     2.40   
 7field_inverse_var             ,     1.73      ,     1.73      ,     1.74
 8
 9gcc 11.2, master:
10
11Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
12
13scalar_inverse                ,     2.39      ,     2.39      ,     2.41   
14scalar_inverse_var            ,     1.77      ,     1.77      ,     1.78   
15field_inverse                 ,     2.37      ,     2.38      ,     2.39   
16field_inverse_var             ,     1.76      ,     1.77      ,     1.82
17
18
19clang 13.0.1, pr:
20Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
21
22scalar_inverse                ,     2.70      ,     2.70      ,     2.72   
23scalar_inverse_var            ,     1.69      ,     1.70      ,     1.70   
24field_inverse                 ,     2.69      ,     2.70      ,     2.70   
25field_inverse_var             ,     1.69      ,     1.69      ,     1.70   
26
27clang 13.0.1, master:
28
29Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
30
31scalar_inverse                ,     2.69      ,     2.70      ,     2.71   
32scalar_inverse_var            ,     1.69      ,     1.69      ,     1.70   
33field_inverse                 ,     2.69      ,     2.69      ,     2.69   
34field_inverse_var             ,     1.68      ,     1.68      ,     1.68

But this is with asm on, I should have turned it off… Would be nice to see more benchmarks.

in src/int128_struct_impl.h:168 in b0d2fe0d9e outdated

120+   r->hi = (uint64_t)(a >> 63);
121+   r->lo = (uint64_t)a;
122+}
123+
124+static SECP256K1_INLINE int secp256k1_i128_eq(const secp256k1_int128 *a, const secp256k1_int128 *b) {
125+   return a->hi == b->hi && a->lo == b->lo;

real-or-random commented at 2:19 pm on March 1, 2022:

I think we should name this secp256k1_i128_eq_var. Or try to make it const-time. A standard way would be:

0static SECP256K1_INLINE int secp256k1_i128_eq(const secp256k1_int128 *a, const secp256k1_int128 *b) {
1   return !((a->hi ^ b->hi) | (a->lo ^ b->lo));

Not sure if this works though.

roconnor-blockstream commented at 4:55 pm on March 7, 2022:

I’m going to change it to eq_var, since it is far from obvious to me that the native version is constant time.

roconnor-blockstream commented at 9:16 pm on March 7, 2022:

done.

in src/int128_struct_impl.h:173 in b0d2fe0d9e outdated

124+static SECP256K1_INLINE int secp256k1_i128_eq(const secp256k1_int128 *a, const secp256k1_int128 *b) {
125+   return a->hi == b->hi && a->lo == b->lo;
126+}
127+
128+static SECP256K1_INLINE int secp256k1_i128_check_bit(secp256k1_int128 *r, unsigned int n) {
129+   return n >= 64 ? r->hi == (uint64_t)1 << (n - 64) && r->lo == 0

real-or-random commented at 2:22 pm on March 1, 2022:

0   VERIFY_CHECK(n < 128);
1   return n >= 64 ? r->hi == (uint64_t)1 << (n - 64) && r->lo == 0

same for the unsigned version

roconnor-blockstream commented at 9:16 pm on March 7, 2022:

done.

in src/int128_struct_impl.h:105 in b0d2fe0d9e outdated

100+}
101+
102+/* Signed (arithmetic) right shift.
103+ * Non-constant time in b.
104+ */
105+static SECP256K1_INLINE void secp256k1_i128_rshift(secp256k1_int128 *r, unsigned int b) {

real-or-random commented at 2:23 pm on March 1, 2022:

0static SECP256K1_INLINE void secp256k1_i128_rshift(secp256k1_int128 *r, unsigned int b) {
1    VERIFY_CHECK(b < 128);

same for the unsigned version

roconnor-blockstream commented at 9:16 pm on March 7, 2022:

done.

real-or-random commented at 2:36 pm on March 1, 2022: contributor

I haven’t reviewed the struct implementations of secp256k1_?128_accum/mul/mulh/rshift algorithms in detail. I think it will be good to have some randomized unit tests here. (Maybe it’s ok to just run them when we have a native type too and compare the result?). Most of these functions should be exercised by the current tests already but some may be not. For example, the shift functions are only called with specific shifts, so I think some branches will never be taken.

in src/util.h:234 in b0d2fe0d9e outdated

249@@ -250,28 +250,23 @@ static SECP256K1_INLINE void secp256k1_int_cmov(int *r, const int *a, int flag)
250     *r = (int)(r_masked | a_masked);
251 }
252 
253-/* If USE_FORCE_WIDEMUL_{INT128,INT64} is set, use that wide multiplication implementation.
254+/* If USE_FORCE_WIDEMUL_{INT128, INT128_STRUCT, INT64} is set, use that wide multiplication implementation.
255  * Otherwise use the presence of __SIZEOF_INT128__ to decide.

real-or-random commented at 3:41 pm on March 1, 2022:

I think we should detect 64-bit MSVC here and select SECP256K1_INT128_STRUCT.

roconnor-blockstream commented at 4:34 pm on March 1, 2022:

I’d prefer that done in a separate PR.

real-or-random commented at 9:23 pm on March 1, 2022:

Ok sure, that will make things easier. :+1:

edit: If the struct implementation is anyway only enabled with the right configure flag, we can also postpone the #if defined(_M_X64) | defined(_M_ARM64) | defined(_WIN64) discussion and figure the best macros out later.

real-or-random commented at 11:39 am on March 10, 2022:

I think we should detect 64-bit MSVC here and select SECP256K1_INT128_STRUCT.

Note for the future PR: What we should actually do is to use to word size to decide which implementation to use, independently of the exact compiler.

roconnor-blockstream force-pushed on Mar 1, 2022

roconnor-blockstream commented at 4:59 pm on March 1, 2022: contributor

Rebased.

roconnor-blockstream force-pushed on Mar 7, 2022

in src/int128_struct_impl.h:52 in 101cd3e290 outdated

47+   VERIFY_CHECK(n < 128);
48+   if (n >= 64) {
49+     r->lo = (r->hi) >> (n-64);
50+     r->hi = 0;
51+   } else if (n > 0) {
52+     r->lo = ((r->hi & (((uint64_t)1<<n)-1)) << (64-n)) | r->lo >> n;

roconnor-blockstream commented at 11:23 pm on March 14, 2022:

@real-or-random Unsigned overflow from shifting is well defined right? So I don’t need the mask here?

real-or-random commented at 9:44 am on March 15, 2022:

Right, unsigned overflow is never UB, even when it comes from shifting. See C99: https://port70.net/~nsz/c/c99/n1256.html#6.5.7p4

(The corresponding C89 part is a little bit strange since it doesn’t mention the UB explicitly and it works under the assumption that int and long are the only non-promoted integer types: https://port70.net/~nsz/c/c89/c89-draft.html#3.3.7 I mean ok, stdint.h does not exist in C89…)

roconnor-blockstream commented at 12:16 pm on March 15, 2022:

Oh I remember why I always write it this way, because we technically don’t know the size of signed int.

laanwj commented at 11:17 pm on May 27, 2022:

yes, unsigned overflow is always defined, it’s wrap-around no matter what

roconnor-blockstream commented at 0:07 am on May 28, 2022:

Right. But technically speaking a signed int could, in some unrealistic universe, be 65 bits, thus subjecting r->hi to signed integer promotion and thus causing signed integer overflow.

That said, if people would like me to rewrite this to assume 32-bit signed ints, I’m happy to change it.

real-or-random commented at 1:07 pm on July 11, 2022:

That said, if people would like me to rewrite this to assume 32-bit signed ints, I’m happy to change it.

Well I think we assume at a lot of places in the code that ints are at most 32 bits (see also “No integer promotion for uint32_t” in assumptions.h). I think if you took care of this elsewhere in the entire file, it’s meaningful to keep that “protection” (but I don’t have a strong opinion).

Though I think something like this is the idiomatic way of protecting against unsigned to signed promotion:

0     r->lo = ((0U + r->hi) << (64-n)) | r->lo >> n;

roconnor-blockstream commented at 7:10 pm on July 11, 2022:

Done something similar to the suggestion.

in src/int128_struct_impl.h:105 in 101cd3e290 outdated

100+   secp256k1_i128_mul(r, a, d);
101+   secp256k1_i128_dissip_mul(r, b, c);
102+}
103+
104+/* Signed (arithmetic) right shift.
105+ * Non-constant time in b.

roconnor-blockstream commented at 8:54 pm on March 28, 2022:

in n.

roconnor-blockstream commented at 3:58 pm on May 27, 2022:

Done.

in src/int128_native_impl.h:76 in 101cd3e290 outdated

68+   return *a == *b;
69+}
70+
71+static SECP256K1_INLINE int secp256k1_i128_check_bit(const secp256k1_int128 *r, unsigned int n) {
72+   VERIFY_CHECK(n < 128);
73+   return (*r == (int128_t)1 << n);

roconnor-blockstream commented at 9:09 pm on April 25, 2022:

1 « 127 would cause signed integer overflow.

This never happens, but the VERIFY_CHECK should be reduced to n < 127.

roconnor-blockstream commented at 3:58 pm on May 27, 2022:

Done.

roconnor-blockstream force-pushed on May 27, 2022

roconnor-blockstream referenced this in commit 16ce0343a2 on May 27, 2022

roconnor-blockstream commented at 9:35 pm on June 7, 2022: contributor

I’ve written formal correctness proofs of the functions in the src/int128_struct_impl.h file using VST. The most relevant part is the function specifications which can be found at lines of the form Defintion secp256k1_*_spec : ...

in src/int128_native_impl.h:51 in 0a384cc046 outdated

46+   VERIFY_CHECK(0 <= ab ? *r <= INT128_MAX - ab : INT128_MIN - ab <= *r);
47+   *r += ab;
48+}
49+
50+static SECP256K1_INLINE void secp256k1_i128_det(secp256k1_int128 *r, int64_t a, int64_t b, int64_t c, int64_t d) {
51+   *r = (int128_t)a * d - (int128_t)b * c;

real-or-random commented at 11:43 am on July 11, 2022:

Would it make sense to check for overflow here?

roconnor-blockstream commented at 7:11 pm on July 11, 2022:

Done.

in src/int128.h:51 in 0a384cc046 outdated

46+
47+static SECP256K1_INLINE void secp256k1_i128_from_i64(secp256k1_int128 *r, int64_t a);
48+
49+static SECP256K1_INLINE int secp256k1_i128_eq_var(const secp256k1_int128 *a, const secp256k1_int128 *b);
50+
51+static SECP256K1_INLINE int secp256k1_i128_check_bit(const secp256k1_int128 *r, unsigned int n);

real-or-random commented at 11:51 am on July 11, 2022:

0static SECP256K1_INLINE int secp256k1_i128_check_pow2(const secp256k1_int128 *r, unsigned int n);

I think this one deserves a different name. check_bit is very close to check_bits but the sematics are rather different.

A comment could also be helpful here, and maybe also for secp256k1_u128_check_bits

roconnor-blockstream commented at 7:11 pm on July 11, 2022:

Done.

in src/modinv64_impl.h:75 in 0a384cc046 outdated

69@@ -60,6 +70,13 @@ static int secp256k1_modinv64_mul_cmp_62(const secp256k1_modinv64_signed62 *a, i
70     }
71     return 0;
72 }
73+
74+/* Check if the determinant of t is equal to 1 << n. */
75+static int secp256k1_modinv64_det_bit(const secp256k1_modinv64_trans2x2 *t, unsigned int n) {

real-or-random commented at 1:17 pm on July 11, 2022:

If you accept the other change, then this should probably be

0static int secp256k1_modinv64_det_check_pow2(const secp256k1_modinv64_trans2x2 *t, unsigned int n) {

roconnor-blockstream commented at 7:11 pm on July 11, 2022:

Done.

in src/int128_struct_impl.h:113 in 0a384cc046 outdated

108+   VERIFY_CHECK(n < 128);
109+   if (n >= 64) {
110+     r->lo = (uint64_t)((int64_t)(r->hi) >> (n-64));
111+     r->hi = (uint64_t)((int64_t)(r->hi) >> 63);
112+   } else if (n > 0) {
113+     r->lo = ((r->hi & (((uint64_t)1<<n)-1)) << (64-n)) | r->lo >> n;

real-or-random commented at 1:17 pm on July 11, 2022:

(as above)

roconnor-blockstream commented at 7:11 pm on July 11, 2022:

Done.

in src/scalar_4x64_impl.h:128 in 0a384cc046 outdated

151+    secp256k1_u128_accum_u64(&t, ((uint64_t)((bit >> 6) == 3)) << (bit & 0x3F));
152+    r->d[3] = secp256k1_u128_to_u64(&t);
153 #ifdef VERIFY
154-    VERIFY_CHECK((t >> 64) == 0);
155+    secp256k1_u128_rshift(&t, 64);
156+    VERIFY_CHECK(secp256k1_u128_to_u64(&t) == 0);

real-or-random commented at 1:28 pm on July 11, 2022:

0    VERIFY_CHECK(secp256k1_u128_check_bits(&t, 64));

roconnor-blockstream commented at 5:35 pm on July 11, 2022:

I’ll compromise with VERIFY_CHECK(secp256k1_u128_hi_u64(&t) == 0);.

roconnor-blockstream commented at 7:12 pm on July 11, 2022:

Done.

in src/scalar_4x64_impl.h:582 in 0a384cc046 outdated

585+    secp256k1_u128_accum_u64(&c128, p2);
586+    secp256k1_u128_accum_u64(&c128, p4);
587+    r->d[2] = secp256k1_u128_to_u64(&c128); secp256k1_u128_rshift(&c128, 64);
588+    secp256k1_u128_accum_u64(&c128, p3);
589+    r->d[3] = secp256k1_u128_to_u64(&c128); secp256k1_u128_rshift(&c128, 64);
590+    c = secp256k1_u128_to_u64(&c128);

real-or-random commented at 1:37 pm on July 11, 2022:

I suggest

0    r->d[3] = secp256k1_u128_to_u64(&c128);
1    c = secp256k1_u128_hi_u64(&c128);

or even inline secp256k1_u128_to_u64(...) in the last line of the function, then one c would suffice and you’d avoid introducing a c128.

roconnor-blockstream commented at 5:33 pm on July 11, 2022:

The last line of the function exists outside this #if block (notice that the type of c is originally different between the two blocks). So I cannot avoid making a c value here. I suppose I could duplicate the last line if you think that is worthwhile.

real-or-random commented at 5:38 pm on July 11, 2022:

Oh,, I didn’t didn’t see that the ASM part uses the last line.

. I suppose I could duplicate the last line if you think that is worthwhile.

I don’t think so.

roconnor-blockstream commented at 7:11 pm on July 11, 2022:

Done.

in src/int128_struct_impl.h:94 in 0a384cc046 outdated

80+
81+static SECP256K1_INLINE void secp256k1_i128_accum_mul(secp256k1_int128 *r, int64_t a, int64_t b) {
82+   uint64_t lo = (uint64_t)a * (uint64_t)b;
83+   uint64_t hi = (uint64_t)secp256k1_mulh(a, b) + (~lo < r->lo);
84+   VERIFY_CHECK((r->hi <= 0x7fffffffffffffffu && hi <= 0x7fffffffffffffffu) <= (r->hi + hi <= 0x7fffffffffffffffu));
85+   VERIFY_CHECK((r->hi > 0x7fffffffffffffffu && hi > 0x7fffffffffffffffu) <= (r->hi + hi > 0x7fffffffffffffffu));

real-or-random commented at 3:54 pm on July 11, 2022:

It took me a while to parse and understand these. Maybe it’s a good idea to add some comments.

Let me try to explain my own words. Please tell me if I’m wrong.

r->hi <= 0x7fffffffffffffffu means “r is positive.”
X <= Y is a way of writing $X \implies Y$ if we know that $X \in {0,1}$. (I’ve never seen this. I had expected !X || Y, which works for all X. But <= is totally fine if it comes with a comment.)

So the first line says: If r is positive and we’re adding a positive number, then the sum should be positive. (Which is equivalent to having no overflow.)

The second line is the analogous thing that checks underflow.

roconnor-blockstream commented at 7:12 pm on July 11, 2022:

Done.

roconnor-blockstream force-pushed on Jul 11, 2022

real-or-random commented at 7:59 pm on July 11, 2022: contributor

This is about 1% slower on my machine with gcc -O2 but about 1.5% faster with gcc -03.

gcc 12.1.0 -O2, x86_64, asm disabled:

 0Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
 1
 2master:
 3
 4ecdsa_verify                  ,    53.7       ,    53.9       ,    54.3    
 5ecdsa_sign                    ,    39.5       ,    39.7       ,    40.9    
 6
 7
 8pr:
 9
10ecdsa_verify                  ,    54.2       ,    54.4       ,    55.2    
11ecdsa_sign                    ,    39.7       ,    39.9       ,    40.6

gcc 12.1.0 -O3, x86_64, asm disabled:

 0
 1Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
 2
 3master:
 4
 5ecdsa_verify                  ,    52.2       ,    52.5       ,    53.2    
 6ecdsa_sign                    ,    39.6       ,    39.7       ,    40.3   
 7
 8pr:
 9
10ecdsa_verify                  ,    51.6       ,    51.8       ,    52.4    
11ecdsa_sign                    ,    39.5       ,    39.7       ,    40.4

And it seems that gcc version 12 is finally faster at -O3 than at -O2…

Do we care about the loss in O2? I guess no.

roconnor-blockstream commented at 8:27 pm on July 11, 2022: contributor

Is the generated asm different?

real-or-random approved

real-or-random commented at 10:24 pm on July 11, 2022: contributor

ACK 7edf1a373236f3200dee77aae12d9993b9d12995 careful code inspection, a lot of tests with --with-test-override-wide-multiply=int128_struct (which is not covered by CI yet), I have not looked at the VST proofs in details

real-or-random commented at 10:27 pm on July 11, 2022: contributor

Is the generated asm different?

Yeah, we could check but I imagine it will be a lot of work. It would probably make sense to first run the internal benchmarks and see where we lose performance. And then look at the asm of these functions.

in src/int128_impl.h:2 in 3560a1f09d outdated

0@@ -0,0 +1,25 @@
1+/***********************************************************************
2+ * Copyright (c) 2014 Pieter Wuille                                    *

sipa commented at 4:28 pm on July 21, 2022:

I can’t recall writing this in 2014.

roconnor-blockstream commented at 2:10 pm on July 25, 2022:

Arguably this file is a derived work of your scalar_impl.h file. :) But I’m happy to remove it if you like.

roconnor-blockstream commented at 3:21 pm on July 27, 2022:

Done.

real-or-random commented at 3:03 pm on July 29, 2022:

Not sure if it’s better without a copyright header.

Arguably this file is a derived work of your scalar_impl.h file. :) But I’m happy to remove it if you like.

Then you may be violating the MIT license because it requires you to keep the copyright notice. :D

For me personally, I don’t care. In the interest of getting this merged, I’ll ACK this with and without notice, and with any of your names. But if you ask me, we should really see if I can make copyright notices easier in the future.

in src/int128.h:16 in 3560a1f09d outdated

13+#include "int128_struct.h"
14+#else
15+#error "Please select int128 implementation"
16+#endif
17+
18+static SECP256K1_INLINE void secp256k1_u128_mul(secp256k1_uint128 *r, uint64_t a, uint64_t b);

sipa commented at 6:33 pm on July 21, 2022:

Would it be possible to add a one-line comment about the semantics of each of these functions?

roconnor-blockstream commented at 3:22 pm on July 27, 2022:

Done.

in src/int128_struct_impl.h:12 in 3560a1f09d outdated

 7+  #include <intrin.h>
 8+  #define secp256k1_umulh __umulh
 9+  #define secp256k1_mulh __mulh
10+#else
11+static SECP256K1_INLINE uint64_t secp256k1_umulh(uint64_t a, uint64_t b) {
12+   uint64_t t1 = (uint64_t)(uint32_t)a * (uint32_t)b;

sipa commented at 7:04 pm on July 21, 2022:

Indentation is 4 spaces, both in functions and if/then/else blocks. (here and elsewhere in this PR)

roconnor-blockstream commented at 3:21 pm on July 27, 2022:

Merged your commit.

in src/int128_struct_impl.h:34 in 3560a1f09d outdated

29+   r->lo = a * b;
30+}
31+
32+static SECP256K1_INLINE void secp256k1_u128_accum_mul(secp256k1_uint128 *r, uint64_t a, uint64_t b) {
33+   uint64_t lo = a * b;
34+   r->hi += secp256k1_umulh(a, b) + (~lo < r->lo);

sipa commented at 7:09 pm on July 21, 2022:

Interesting construction. The usual approach I’ve seen for this would be:

0    uint64_t lo = a * b;
1    r->lo += lo;
2    r->hi += secp256k1_umulh(a, b) + (r->lo < lo);

I don’t think we always care about the performance of the struct approach here in most cases, but at least for MSVC this does become the most performant configuration. Any reason to think this code is preferable? Perhaps it’s worth looking at how MSVC compiles both.

real-or-random commented at 9:27 am on July 22, 2022:

Yes, yours seems better, see https://godbolt.org/z/d593Yxocq

sipa commented at 1:02 pm on July 22, 2022:

The generated code still uses two multiplication instructions it seems, which is very suboptimal (though obviously better than 4 32-bit multiplications).

I think the right approach is using _mul128 and _umul128 on MSVC actually, which compute both the low and high half. I think that all code in this int128_struct implementation which needs mulh/umulh also needs the lower half, so perhaps it’s better to write things in function of those primitives?

roconnor-blockstream commented at 1:33 pm on July 22, 2022:

As I recall _mul128 and _umul128 are not available on 32-bit platforms. That said, it still makes sense to abstract through those function.

sipa commented at 1:44 pm on July 22, 2022:

My understanding is:

x64 has all four: _mulh, _umulh, _mul128, and _umul128.
arm64 only has _mulh and _umulh.
x86 and arm32 have none

The documentation is a bit inconsistent (one MSVC site says the _mulh are only available on x64, but another page lists them as available for arm64 too).

Still, this matches what the hardware provides. x64 does have a 64x64->128 instruction, while arm64 only has separate low and high 64x64->64 multiplication instructions.

sipa commented at 2:07 pm on July 22, 2022:

Trying to make a patch that implements this.

sipa commented at 3:08 pm on July 22, 2022:

See https://github.com/sipa/secp256k1/commits/pr1000_mul128

sipa commented at 6:11 pm on July 22, 2022:

I made a few small changes to the branch above, but I’m not going to touch it anymore.

Feel free to ignore, include, or squash into this PR (perhaps subject to translating the changes to the formal proof code?).

roconnor-blockstream commented at 3:21 pm on July 27, 2022:

Merged your commit.

in src/int128_struct_impl.h:39 in 3560a1f09d outdated

34+   r->hi += secp256k1_umulh(a, b) + (~lo < r->lo);
35+   r->lo += lo;
36+}
37+
38+static SECP256K1_INLINE void secp256k1_u128_accum_u64(secp256k1_uint128 *r, uint64_t a) {
39+   r->hi += (r->lo > ~a);

sipa commented at 7:12 pm on July 21, 2022:

Likewise, here I’d use

0r->lo += a;
1r->hi += (r->lo < a);

roconnor-blockstream commented at 3:22 pm on July 27, 2022:

Merged your commit.

in src/int128_struct_impl.h:49 in 3560a1f09d outdated

44+ * Non-constant time in n.
45+ */
46+static SECP256K1_INLINE void secp256k1_u128_rshift(secp256k1_uint128 *r, unsigned int n) {
47+   VERIFY_CHECK(n < 128);
48+   if (n >= 64) {
49+     r->lo = (r->hi) >> (n-64);

sipa commented at 8:46 pm on July 21, 2022:

Nit: braces around (r->hi) are unnecessary.

roconnor-blockstream commented at 3:22 pm on July 27, 2022:

Merged your commit.

in src/int128_struct_impl.h:83 in 3560a1f09d outdated

78+   r->lo = (uint64_t)a * (uint64_t)b;
79+}
80+
81+static SECP256K1_INLINE void secp256k1_i128_accum_mul(secp256k1_int128 *r, int64_t a, int64_t b) {
82+   uint64_t lo = (uint64_t)a * (uint64_t)b;
83+   uint64_t hi = (uint64_t)secp256k1_mulh(a, b) + (~lo < r->lo);

sipa commented at 8:51 pm on July 21, 2022:

Also here:

0uint64_t lo = (uint64_t)a * b;
1r->lo += lo;
2r->hi += (uint64_t)secp256k1_mulh(a, b) + (r->lo < lo);

(needs changes to the VERIFY_CHECKs too, though)

roconnor-blockstream commented at 2:18 pm on July 22, 2022:

The VERIFY_CHECKS only operate on hi values, so I don’t think the checks will need changes.

roconnor-blockstream commented at 3:22 pm on July 27, 2022:

Merged your commit.

roconnor-blockstream force-pushed on Jul 27, 2022

real-or-random commented at 2:52 pm on July 29, 2022: contributor

@roconnor-blockstream Did you update your VST proofs to the changes suggested by Pieter? (If no, are planning to do so?)

in src/int128.h:23 in 176ea45704 outdated

18+/* Multiply two unsigned 64-bit values a and b and add the result to r.
19+ * The final result is taken moduluo 2^128.
20+ */
21+static SECP256K1_INLINE void secp256k1_u128_accum_mul(secp256k1_uint128 *r, uint64_t a, uint64_t b);
22+
23+/* Add a unsigned 64-bit value a to r.

real-or-random commented at 2:52 pm on July 29, 2022:

0/* Add an unsigned 64-bit value a to r.

roconnor-blockstream commented at 1:50 pm on August 8, 2022:

Done.

in src/int128.h:50 in 176ea45704 outdated

45+static SECP256K1_INLINE int secp256k1_u128_check_bits(const secp256k1_uint128 *r, unsigned int n);
46+
47+/* Multiply two signed 64-bit values a and b and write the result to r. */
48+static SECP256K1_INLINE void secp256k1_i128_mul(secp256k1_int128 *r, int64_t a, int64_t b);
49+
50+/* Multiply two unsigned 64-bit values a and b and add the result to r.

real-or-random commented at 2:53 pm on July 29, 2022:

0/* Multiply two signed 64-bit values a and b and add the result to r.

roconnor-blockstream commented at 1:50 pm on August 8, 2022:

Done.

in src/int128.h:51 in 176ea45704 outdated

46+
47+/* Multiply two signed 64-bit values a and b and write the result to r. */
48+static SECP256K1_INLINE void secp256k1_i128_mul(secp256k1_int128 *r, int64_t a, int64_t b);
49+
50+/* Multiply two unsigned 64-bit values a and b and add the result to r.
51+ * Any overflow or underflow is undefined behaviour.

real-or-random commented at 2:54 pm on July 29, 2022:

I think “any” means just in the final addition, not in the multiplication. (It’s pretty clear if you think about it but may be worth clarifying.)

roconnor-blockstream commented at 1:50 pm on August 8, 2022:

Done.

in src/int128.h:63 in 176ea45704 outdated

58+/* Signed (arithmetic) right shift.
59+ * Non-constant time in b.
60+ */
61+static SECP256K1_INLINE void secp256k1_i128_rshift(secp256k1_int128 *r, unsigned int b);
62+
63+/* Return the low 64-bits of a 128-bit value intepreted as an signed 64-bit value. */

real-or-random commented at 2:55 pm on July 29, 2022:

0/* Return the low 64-bits of a 128-bit value interpreted as an signed 64-bit value. */

roconnor-blockstream commented at 1:50 pm on August 8, 2022:

Done.

in src/int128.h:39 in 176ea45704 outdated

34+static SECP256K1_INLINE uint64_t secp256k1_u128_to_u64(const secp256k1_uint128 *a);
35+
36+/* Return the high 64-bits of a 128-bit value as an unsigned 64-bit value. */
37+static SECP256K1_INLINE uint64_t secp256k1_u128_hi_u64(const secp256k1_uint128 *a);
38+
39+/* Write a unsigned 64-bit value to r. */

real-or-random commented at 2:56 pm on July 29, 2022:

0/* Write an unsigned 64-bit value to r. */

roconnor-blockstream commented at 1:50 pm on August 8, 2022:

Done.

real-or-random commented at 2:57 pm on July 29, 2022: contributor

just some nits and typos in the new comments

roconnor-blockstream force-pushed on Aug 8, 2022

roconnor-blockstream commented at 1:52 pm on August 8, 2022: contributor

I have not updated the proofs, but I do plan to do so. I don’t suggest waiting for them, but if you prefer to wait that’s fine.

I’m working on a second pass through the proofs anyways in an attempt to add more automatic tactics to help me, so it is slow going.

roconnor-blockstream commented at 9:45 pm on August 29, 2022: contributor

While I continue to refactor my proofs to automate the tedious bits, I have now got to the point where the revised code in the int128_struct file is now proved correct.

roconnor-blockstream referenced this in commit 38498beeda on Oct 24, 2022

roconnor-blockstream commented at 9:33 pm on October 24, 2022: contributor

I’m done my pass at refactoring my proof. There isn’t too much to say. The new proofs are less tedious than they were before. You are welcome to “browse” the proof at https://htmlpreview.github.io/?https://github.com/ElementsProject/simplicity/blob/027e46ca91dc7eecf5a05707798edf17e72612c3/alectryon/verif_int128_impl.v.html for whatever that’s worth. Again, the *_spec definitions are probably the most important bits as they define the pre and post-conditions of the functions that form the statements of what I’ve proved.

The most notable change is that the specification for secp256k1_i128_to_i64 was amended to remove the precondition that the input be a 64-bit signed integer value. Apparently libsecp256k1 on occasion will cast signed int128 numbers to signed int64 numbers, expecting to truncate the bits. That seems kinda dicey to me, but okay.

real-or-random commented at 8:09 am on October 25, 2022: contributor

Apparently libsecp256k1 on occasion will cast signed int128 numbers to signed int64 numbers, expecting to truncate the bits. That seems kinda dicey to me, but okay.

Do you have an example where this is unexpected (according to your impression)?

roconnor-blockstream commented at 2:46 pm on October 25, 2022: contributor

Looking into it more closely it seems the sketchy cases of casting large signed int128 numbers to int64 is always then immediately masked with & M62, so it isn’t so dicey in practice.

To check on this I created a commit that splits out a secp256k1_i128_to_u64 function and adds a VERIFY_CHECK to the secp256k1_i128_to_i64 function. This passes the tests. It will fail the tests with only the VERIFY_CHECK added.

I’m not proposing we add that commit. I only made it for validation purposes. Although we could add it if people really think that is a good idea, but that would be a separate PR.

bitcoin-core deleted a comment on Nov 2, 2022

in src/int128.h:24 in beafb5538b outdated

19+ * The final result is taken moduluo 2^128.
20+ */
21+static SECP256K1_INLINE void secp256k1_u128_accum_mul(secp256k1_uint128 *r, uint64_t a, uint64_t b);
22+
23+/* Add an unsigned 64-bit value a to r.
24+ * The final result is taken moduluo 2^128.

jonasnick commented at 4:57 pm on November 3, 2022:

nit: moduluo typo x2

roconnor-blockstream commented at 10:01 pm on November 7, 2022:

Fixed.

jonasnick commented at 4:57 pm on November 3, 2022: contributor

I created a commit (https://github.com/jonasnick/secp256k1/commits/20211026_int128-jn) that adds int128_struct tests to CI (including an MSVC test that uses native (u)mul128). Feel free to cherry-pick and squash.

I also ran some benchmarks just to confirm that the changes to field_5x52_int128 and scalar_4x64 don’t unexpectedly affect performance (https://gist.github.com/jonasnick/aed0f881aabe9b9f5bd769fcf3efbef8)

in src/int128_struct_impl.h:23 in beafb5538b outdated

18+
19+static SECP256K1_INLINE int64_t secp256k1_mul128(int64_t a, int64_t b, int64_t* hi) {
20+    *hi = __mulh(a, b);
21+    return a * b;
22+}
23+#    endif

jonasnick commented at 2:46 pm on November 7, 2022:

Did we test this code path on ARM64 MSVC? Should we remove this until this is tested in CI?

roconnor-blockstream commented at 7:15 pm on November 7, 2022:

I’m not sure how our testing infrastructure works.

sipa commented at 4:45 pm on November 13, 2022:

As a very approximate test, I’ve replaced the __umulh and __mulh code with GCC inline asm code matching my understanding of what these intrinsics compile to in MSVC, and compiled with --host=aarch64-linux-gnu --with-test-override-wide-multiply=int128_struct with GCC. The resulting binary does contain the expected umulh and smulh instructions, and tests work fine in qemu-aarch64.

 0--- a/src/int128_struct_impl.h
 1+++ b/src/int128_struct_impl.h
 2@@ -3,21 +3,27 @@
 3 
 4 #include "int128.h"
 5 
 6-#if defined(_MSC_VER) && (defined(_M_X64) || defined(_M_ARM64)) /* MSVC */
 7-#    include <intrin.h>
 8+#if 1 /* MSVC */
 9 #    if defined(_M_X64)
10 /* On x84_64 MSVC, use native _(u)mul128 for 64x64->128 multiplications. */
11 #        define secp256k1_umul128 _umul128
12 #        define secp256k1_mul128 _mul128
13 #    else
14 /* On ARM64 MSVC, use __(u)mulh for the upper half of 64x64 multiplications. */
15+#warning "Using asm umulh code"
16 static SECP256K1_INLINE uint64_t secp256k1_umul128(uint64_t a, uint64_t b, uint64_t* hi) {
17-    *hi = __umulh(a, b);
18+    __asm__ __volatile__(
19+       "umulh %0, %1, %2" :
20+       "=r"(*hi) :
21+       "r"(a), "r"(b));
22     return a * b;
23 }
24 
25 static SECP256K1_INLINE int64_t secp256k1_mul128(int64_t a, int64_t b, int64_t* hi) {
26-    *hi = __mulh(a, b);
27+    __asm__ __volatile__(
28+       "smulh %0, %1, %2" :
29+       "=r"(*hi) :
30+       "r"(a), "r"(b));
31     return a * b;
32 }
33 #    endif

jonasnick commented at 12:08 pm on November 14, 2022:

That’s a sufficient approximate test. Thanks @sipa. Resolving this thread.

roconnor-blockstream commented at 3:11 pm on November 14, 2022:

Would you like me to add a comment “Beware of bugs in the above code; We have only proved it correct, not tested it.”?

in src/int128.h:45 in beafb5538b outdated

40+static SECP256K1_INLINE void secp256k1_u128_from_u64(secp256k1_uint128 *r, uint64_t a);
41+
42+/* Tests if r is strictly less than to 2^n.
43+ * n must be strictly less than 128.
44+ */
45+static SECP256K1_INLINE int secp256k1_u128_check_bits(const secp256k1_uint128 *r, unsigned int n);

jonasnick commented at 2:49 pm on November 7, 2022:

secp256k1_u128_check_bits, secp256k1_i128_dissip_mul, secp256k1_i128_det, secp256k1_i128_from_i64, secp256k1_i128_from_i64, secp256k1_i128_check_pow2

are not used when compiled without VERIFY. Should we put them behind #ifdef VERIFY?

roconnor-blockstream commented at 5:52 pm on November 7, 2022:

I don’t know.

jonasnick commented at 8:53 pm on November 7, 2022:

One reason to do this would be if not doing it would negatively impact test coverage metrics. However, I just tested this with gcov and it doesn’t care about these functions at all (presumably because they don’t exist in the compilation result).

sipa commented at 3:33 pm on November 13, 2022:

Yeah I think we can rely on the compiler not emitting any code for static functions that are unused, so this isn’t a big concern.

jonasnick commented at 12:08 pm on November 14, 2022:

It could be done to give a better indication for readers what this code is for, but not a big concern, I agree. Resolving this conversation.

jonasnick commented at 2:59 pm on November 7, 2022: contributor

I added some basic tests to my branch mostly to confirm that the code is working as expected. But I think also generally it would be better if the code was tested. The tests also exercise a branch that was not executed before (i128_rshift with b >= 64).

This also opens up the possibility of removing the 32-bit field and scalar implementations should that ever be desired.

I didn’t really investigate this but benchmarks on my (64-bit) machine indicate that this won’t be easy:

0./configure --with-test-override-wide-multiply=int64 --with-asm=no
1        ecdsa_verify                  103.0
2./configure --with-test-override-wide-multiply=int128_struct --with-asm=no
3        ecdsa_verify                  188.0

Simulated int128 type. 2914bccbc0

int128: Tidy #includes of int128.h and int128_impl.h

After this commit, int128.h and int128_impl.h are included as follows:
 - .c files which use int128 include int128_impl.h (after util.h)
 - .h files which use int128 include int128.h (after util.h)

This list is exhaustive. util.h needs to included first because it sets
up necessary #defines.

dceaa1f579

ci: add int128_struct tests a340d9500a

roconnor-blockstream force-pushed on Nov 7, 2022

roconnor-blockstream commented at 10:02 pm on November 7, 2022: contributor

@jonasnick I’ve merged your tests into one commit.

sipa commented at 11:51 pm on November 7, 2022: contributor

Some benchmarking on ARM64 hardware (AWS a1.large, Linux, GCC 11.2, all with --enable-static --disable-shared):

this PR (a340d9500a9c45e5c261174f48b3eb18b3b3647d):

aarch64 (64-bit):
- --with-test-override-wide-multiply=int64 --with-asm=no:
  - ecdsa_verify , 149.0 , 149.0 , 149.0
  - ecdsa_sign , 98.2 , 98.2 , 98.4
- --with-test-override-wide-multiply=int128 --with-asm=no:
  - ecdsa_verify , 175.0 , 175.0 , 176.0
  - ecdsa_sign , 101.0 , 101.0 , 102.0
- --with-test-override-wide-multiply=int128_struct --with-asm=no:
  - ecdsa_verify , 294.0 , 295.0 , 295.0
  - ecdsa_sign , 154.0 , 154.0 , 154.0
armhf (32-bit, but running on the same 64-bit hardware):
- --with-test-override-wide-multiply=int64 --with-asm=no:
  - ecdsa_verify , 301.0 , 301.0 , 301.0
  - ecdsa_sign , 172.0 , 172.0 , 172.0
- --with-test-override-wide-multiply=int64 --with-asm=arm:
  - ecdsa_verify , 256.0 , 256.0 , 256.0
  - ecdsa_sign , 154.0 , 154.0 , 154.0
- --with-test-override-wide-multiply=int128_struct --with-asm=no:
  - ecdsa_verify , 477.0 , 478.0 , 481.0
  - ecdsa_sign , 265.0 , 266.0 , 266.0

PR’s base commit (694ce8fb2d1fd8a3d641d7c33705691d41a2a860):

aarch64 (64-bit):
- --with-test-override-wide-multiply=int64 --with-asm=no:
  - ecdsa_verify , 149.0 , 149.0 , 149.0
  - ecdsa_sign , 98.2 , 98.3 , 98.3
- --with-test-override-wide-multiply=int128 --with-asm=no:
  - ecdsa_verify , 175.0 , 175.0 , 176.0
  - ecdsa_sign , 102.0 , 102.0 , 102.0
armhf (32-bit):
- --with-test-override-wide-multiply=int64 --with-asm=no:
  - ecdsa_verify , 300.0 , 301.0 , 301.0
  - ecdsa_sign , 172.0 , 172.0 , 172.0
- --with-test-override-wide-multiply=int64 --with-asm=arm:
  - ecdsa_verify , 256.0 , 256.0 , 256.0
  - ecdsa_sign , 155.0 , 155.0 , 155.0

So conclusions (for this platform):

This PR doesn’t make anything worse in terms for existing wide-multiply configurations (int64, int128).
int64 is faster than int128, but within int64, 64-bit output is faster than 32-bit output.
32-bit ARM asm code is still faster than pure C.

sipa commented at 5:46 pm on November 13, 2022: contributor

I’m not sure we should care about whether the MSVC 64-bit ARM logic is tested. We clearly can’t promise testing on every possible hardware/OS/system/compiler combination, so we need to make some choices. Perhaps this is worth including once we have a release: a list of systems we’re testing on, which is a subset of the systems we aim to work on (everything C89 + uint64_t, pretty much).

It’s perhaps a bit strange to include code specifically to optimize performance on MSVC 64-bit ARM, without considering that platform tested. We could drop the optimization, but unless there is a concern that having code for optimization present would be misinterpreted as “supported”, we could also just leave it for now.

in src/modinv64_impl.h:10 in a340d9500a

 6@@ -7,17 +7,25 @@
 7 #ifndef SECP256K1_MODINV64_IMPL_H
 8 #define SECP256K1_MODINV64_IMPL_H
 9 
10+#include "int128.h"

sipa commented at 1:34 am on November 14, 2022:

There should be a #include "int128_impl.h here. Without it, the int128 functions are invoked before being defined, preventing them from being inlined.

If I add that performance is identical on GCC 11.2, with -O2, before and after with asm disabled.

roconnor-blockstream commented at 3:01 pm on November 14, 2022:

@real-or-random Since you wrote https://github.com/real-or-random/secp256k1/commit/af6cffda12e0bd2e595eae5ccdffec2d96b8eef3 do you wish to comment?

sipa commented at 3:27 pm on November 14, 2022:

Some preliminary testing suggests that this can’t actually matter. Investigating further before I insist on addressing this.

sipa commented at 3:45 pm on November 14, 2022:

Closing, this does not actually appear to have any impact on the generated binary in GCC 9 or GCC 12.

jonasnick commented at 12:09 pm on November 14, 2022: contributor

@sipa

We could drop the optimization, but unless there is a concern that having code for optimization present would be misinterpreted as “supported”, we could also just leave it for now.

I agree, my remark was mostly about adding code paths that have never been run at all.

roconnor-blockstream commented at 3:14 pm on November 14, 2022: contributor

Presumably we can add configuration options so that our MSVC test infrastructure can at least exercise this code path on X86.

sipa commented at 3:21 pm on November 14, 2022: contributor

@roconnor-blockstream We could add a configure/config flag for forcing the use of _mulh/_umulh over _mul128/_umul128, and set it for one of our MSVC CI configs.

sipa commented at 8:44 pm on November 14, 2022: contributor

I wrote some randomized tests for the int128 functions: https://github.com/sipa/secp256k1/commits/202211_int128

Not everything is covered, but the most tricky functions are.

real-or-random commented at 10:45 pm on November 14, 2022: contributor

@roconnor-blockstream We could add a configure/config flag for forcing the use of _mulh/_umulh over _mul128/_umul128, and set it for one of our MSVC CI configs.

https://github.com/real-or-random/secp256k1/tree/202211-int128-mulh-override implemented here, ready to be cherry-picked

edit: I’ve tested this on local MSVC on wine with

0./configure --enable-dev-mode --host=x86_64-w64-mingw32 --with-test-override-wide-multiply=int128_struct CPPFLAGS="-DSECP256K1_MSVC_MULH_TEST_OVERRIDE" CC=/opt/msvc/bin/x64/cl CFLAGS="-nologo -diagnosti
1cs:caret" LDFLAGS="-XCClinker -nologo -XCClinker -diagnostics:caret" NM="/opt/msvc/bin/x64/dumpbin -symbols -headers" AR="/opt/msvc/bin/x64/lib"

sipa commented at 3:03 pm on November 16, 2022: contributor

I wrote some randomized tests for the int128 functions: https://github.com/sipa/secp256k1/commits/202211_int128

Not everything is covered, but the most tricky functions are.

Update: now all functions are covered.

sipa commented at 7:14 pm on November 16, 2022: contributor

ACK a340d9500a9c45e5c261174f48b3eb18b3b3647d

I think we can deal with the proposed follow-ups in future PRs.

jonasnick commented at 7:21 pm on November 16, 2022: contributor

ACK a340d9500a9c45e5c261174f48b3eb18b3b3647d

real-or-random merged this on Nov 16, 2022

real-or-random closed this on Nov 16, 2022

real-or-random referenced this in commit e40fd277b7 on Nov 18, 2022

sipa commented at 2:47 pm on November 21, 2022: contributor

Just for reference, I redid the aarch64 benchmarks from #1000 (comment) on more modern hardware (Apple M1, and Amazon’s Graviton 3). In both cases, int128 was fastest, followed by int64, and int128_struct last.

The actual numbers from Amazon g7 instances (which are based on Graviton 3).

 0Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
 1
 2int128
 3ecdsa_verify                  ,    48.6       ,    48.7       ,    48.9    
 4ecdsa_sign                    ,    33.6       ,    33.6       ,    33.6    
 5
 6int64
 7ecdsa_verify                  ,    60.8       ,    60.8       ,    60.9    
 8ecdsa_sign                    ,    44.0       ,    44.0       ,    44.0    
 9
10int128_struct
11ecdsa_verify                  ,   105.0       ,   105.0       ,   105.0    
12ecdsa_sign                    ,    58.2       ,    58.2       ,    58.2

Numbers on Apple M1:

 0Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
 1
 2int128
 3ecdsa_verify                  ,    31.4       ,    31.5       ,    32.0    
 4ecdsa_sign                    ,    23.2       ,    23.3       ,    23.5    
 5
 6int64
 7ecdsa_verify                  ,    54.6       ,    54.8       ,    55.5    
 8ecdsa_sign                    ,    34.0       ,    34.0       ,    34.2    
 9
10int128_struct
11ecdsa_verify                  ,    83.4       ,    83.9       ,    87.3    
12ecdsa_sign                    ,    45.4       ,    45.5       ,    45.5

sipa referenced this in commit 9d47e7b71b on Dec 13, 2022

dhruv referenced this in commit 55ffd47cc6 on Dec 14, 2022

dhruv referenced this in commit 967c65b158 on Dec 14, 2022

roconnor-blockstream referenced this in commit 2238463065 on Jan 10, 2023

dhruv referenced this in commit 78b5ddf28b on Jan 11, 2023

dhruv referenced this in commit 215394a1d5 on Jan 11, 2023

div72 referenced this in commit 945b094575 on Mar 14, 2023

str4d referenced this in commit 0df7b459f6 on Apr 21, 2023

in src/int128_impl.h:16 in a340d9500a

11+#  elif defined(SECP256K1_INT128_STRUCT)
12+#    include "int128_struct_impl.h"
13+#  else
14+#    error "Please select int128 implementation"
15+#  endif
16+#endif

roconnor-blockstream commented at 2:38 pm on May 25, 2023:

Although this file should never be included when SECP256K1_WIDEMUL_INT128 is not defined, we should have an error message here just in case that situation arises. Otherwise confusing error messages happen.

real-or-random commented at 11:29 pm on June 20, 2023:

This file is actually included unconditionally in secp256k1.c. (One could argue that this is a bad idea, but the nice thing is that it keeps the preprocessor logic in the int128 “module”.)

vmta referenced this in commit e1120c94a1 on Jun 4, 2023

vmta referenced this in commit 8f03457eed on Jul 1, 2023

delta1 referenced this in commit 3f32c20932 on Aug 8, 2023

delta1 referenced this in commit 31ac0c1081 on Aug 31, 2023

backpacker69 referenced this in commit 8a5f7d0637 on Mar 5, 2024

roconnor-blockstream deleted the branch on Feb 20, 2025

Synthetic int128 type. #1000