[CI test, dontmerge] PR #1000 with __(u)mul128() based abstractions in int128

sipa commented at 6:25 pm on July 22, 2022: contributor

Simulated int128 type. 8cf46673a3

int128: Tidy #includes of int128.h and int128_impl.h

After this commit, int128.h and int128_impl.h are included as follows:
 - .c files which use int128 include int128_impl.h (after util.h)
 - .h files which use int128 include int128.h (after util.h)

This list is exhaustive. util.h needs to included first because it sets
up necessary #defines.

62554ab655

Switch to int128_struct implementation using (u)mul128 abstractions 2fe03cda23

Use int128_struct on 64-bit MSVC 412f359f1a

sipa force-pushed on Jul 22, 2022

in src/util.h:247 in 412f359f1a

243@@ -244,6 +244,9 @@ static SECP256K1_INLINE void secp256k1_int_cmov(int *r, const int *a, int flag)
244 #elif defined(UINT128_MAX) || defined(__SIZEOF_INT128__)
245 # define SECP256K1_WIDEMUL_INT128 1
246 # define SECP256K1_INT128_NATIVE 1
247+#elif defined(_MSC_VER) && (defined(_M_X64) || defined(_M_ARM64))

real-or-random commented at 5:25 pm on July 29, 2022:

I know this is just a CI test but I think we want something like this after #1000. Then I’d suggest this as a heuristic:

0#elif UINTPTR_MAX >= UINT64_MAX

sipa commented at 5:27 pm on July 29, 2022:

That doesn’t guarantee the existence of __mulh and __umulh?

real-or-random commented at 5:35 pm on July 29, 2022:

No, it doesn’t. But even without access to an efficient 64x64->128 mul, it’s still a good idea to use the int128_struct implementation on a 64bit machine (with the simulated 64x64->128 mul then) because then the 64-bit field/scalar code will be used.

(I admit I don’t know if we’ll run into such a machine/compiler. It may be a good idea to output a preprocessor warning in this case, or actually remove the simulated 64x64->128 mul code because our tests will never exercise it.)

sipa commented at 5:38 pm on July 29, 2022:

Hmm, perhaps we should benchmark the 10x26 field code against the 5x52 + int128_struct code. I’m not convinced about which one is faster.

real-or-random commented at 8:25 am on July 30, 2022:

It’s faster on 64 bit platforms because all the other operations except muls will profit:

 0Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz (x86_64), gcc 12.1.0 -O2
 1
 2Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
 3
 4./configure --with-test-override-wide-multiply=int64
 5
 6ecdsa_verify                  ,    95.3       ,    96.2       ,   102.0    
 7ecdsa_sign                    ,    64.0       ,    64.1       ,    64.7  
 8
 9./configure  --with-test-override-wide-multiply=int128_struct
10
11ecdsa_verify                  ,    55.9       ,    56.4       ,    57.4    
12ecdsa_sign                    ,    42.2       ,    42.4       ,    43.0    
13
14./configure 
15
16ecdsa_verify                  ,    54.9       ,    55.4       ,    56.1    
17ecdsa_sign                    ,    39.9       ,    40.1       ,    40.7

sipa commented at 12:38 pm on July 30, 2022:

Ok, that’s convincing.

roconnor-blockstream commented at 2:00 pm on August 8, 2022:

Want to benchmark on 32-bit systems?

sipa commented at 8:58 pm on August 16, 2022:

When compiling for i686 (but executing on a Ryzen 5950X system), the 32-bit code is over 2x faster than the int128_struct 64-bit code.

There is another problem with using int128_struct on 32-bit systems: it doesn’t result in constant-time code for me (GCC 10.3.0 on x86_64 cross-compiling for i686).

roconnor-blockstream commented at 9:02 pm on August 16, 2022:

Any idea the source of non-constant time?

sipa commented at 9:03 pm on August 16, 2022:

0  r->hi += hi + (r->lo < lo);

roconnor-blockstream commented at 10:22 pm on August 16, 2022:

We should probably change this right?

sipa commented at 1:42 pm on August 17, 2022:

Change it to what? There are instrinsics for some platforms to do addition-with-carry, which likely result in better performance here (as well as constant-time operation), but they’re obviously not standard C.

roconnor-blockstream commented at 1:49 pm on August 17, 2022:

I understood there were similar non-constant time issues in libsecp256k1 where comparisons were being transformed into branches, and that refactoring the code somehow fixed the problem.

sipa commented at 2:02 pm on August 17, 2022:

Maybe, but I’m not sure that’s needed.

All of this is empirical of course - it’s very much possible for vaguely reasonable compilers to convert code like this into either constant time or non-constant time code.

But based on what I’ve seen from GCC, this isn’t too suprising: i686 doesn’t have a (CPU) native 64-bit type; all (u)int64_t operations are effectively already translated to operations on pairs of 32-bit integers. And it seems that on several platforms, GCC compiles comparisons on tuple-represented integers to branches (this is also true on 64-bit platforms with __int128).

Code of this form already occurs in scalar_4x64_impl.h, but it’s not a problem there, because it’s only used on 64-bit platforms, where uint64_t isn’t represented as a pair. For the same reason, it does not seem to be a problem if int128_struct_impl.h is used on 64-bit systems.

real-or-random commented at 3:24 pm on August 19, 2022:

all (u)int64_t operations are effectively already translated to operations on pairs of 32-bit integers. And it seems that on several platforms, GCC compiles comparisons on tuple-represented integers to branches

That makes a lot of sense but recent GCC >=8 seems to get away without a branch, at least in this simple test… https://gcc.godbolt.org/z/nss58G3Ko

If we find a nice of rewriting these comparisons, fine. But I don’t see any way of fixing this, so I wouldn’t bother too much. The build is seriously misconfigured if you use int128_struct on a 32bit machine, and we don’t need to support this.

roconnor-blockstream commented at 10:07 pm on August 8, 2022: contributor

It may be a good idea to output a preprocessor warning in this case, or actually remove the simulated 64x64->128 mul code because our tests will never exercise it.)

We could add a #warning pragma saying “Beware of bugs in the above code; We have only proved it correct, not tested it.”

sipa commented at 8:03 pm on November 16, 2022: contributor

Superseded by #1156.

sipa closed this on Nov 16, 2022

[CI test, dontmerge] PR #1000 with __(u)mul128() based abstractions in int128_struct #1130