Improve constant-timeness on PowerPC

real-or-random commented at 10:38 pm on July 24, 2020: contributor

Attempt at resolving #771 .

This surprisingly seems to improve the situation at least for the compilers available on godbolt.

gmaxwell commented at 11:03 pm on July 24, 2020: contributor

Good news, bad news! That indeed fixes the 32-bit scalar code on Fedora 32 GCC 10.1.1 PPC64LE!

Bad news: I’m reasonably confident that that ternary operator is there because for i386 cmov was generally slower than branching until core2, and the ternary was needed to get branchless behaviour there. I don’t currently have a ready to go i386 testing setup– but I’ll set one up if you’ll work with me on this.

PPC64LE also is non-constant time on:

     sign = 2 * (u_last > 0) - 1;

I didn’t see a clear way to fix that.

(also I think this highlights a greater problem that we can’t trust statements like this to be constant time. I think we have a lot of them, they’re not currently an issue through luck of the optimizer).

I’m happy to test PPC64LE btw (btw: luke-jr, grubles, bluematt, and td-linux on IRC also have ppc64 systems– probably other people too, it’s probably easier to get developers to test on PPC64LE right now, though i386 is still a relevant embedded/old system target).

gmaxwell commented at 11:05 pm on July 24, 2020: contributor

Aside, the 64-bit scalar code is also full of the “pointless” expressions– on ppc64 w/ this compiler they don’t compile into something valgrind complains about. … and also I think they’re there for the same reason they are there in the 32-bit code.

(aside, wow travis takes an absurdly long time to run)

real-or-random commented at 10:56 am on July 25, 2020: contributor

Good news, bad news! That indeed fixes the 32-bit scalar code on Fedora 10 PPC64LE!

Hooray.

Bad news: I’m reasonably confident that that ternary operator is there because for i386 cmov was generally slower than branching until core2, and the ternary was needed to get branchless behaviour there.

Yeah, given that this syntax is literally redundant, I suspected it was there for a reason.

I had a closer look now. It seems to me that the only “bad” instruction is c2 += (c1 < th) ? 1 : 0; in muladd, and this is compiled to a branch only in the first of three subsequent muladds. It seems the compiler overdoes reordering here, and may get it wrong somehow as a a result of this.

Changing this single line to c2 += (c1 < th) or even c2 += (c1 >= th) ? 0 : 1; seems to do the trick. Of course all of this is somewhat fragile but I’m still somewhat confident that the latter won’t change a lot for i386. I think we should document then, why these conditional operators are there.

I don’t currently have a ready to go i386 testing setup– but I’ll set one up if you’ll work with me on this.

Cool, let’s do this.

PPC64LE also is non-constant time on:
0     sign = 2 * (u_last > 0) - 1;
I didn’t see a clear way to fix that.

Hm, yeah, that one seems harder. Have you tried a shift or a conditional operator (haha) or a conditional operator with a reversed condition (hahaha)?

(also I think this highlights a greater problem that we can’t trust statements like this to be constant time. I think we have a lot of them, they’re not currently an issue through luck of the optimizer).

Yeah but sadly that’s still state of the art for this stuff.

gmaxwell commented at 11:00 am on July 25, 2020: contributor

When we’ve more fully characterized the situation we can also take our pleadings to the compiler gods and maybe get some advice that will result in something more stable (or get an unintended ‘optimization’ turned off).

gmaxwell commented at 1:45 am on July 26, 2020: contributor

Great news, valgrind still passes with this on i386 debian 8 (gcc 4.9.2), I’ve tried a mixture of compiler options (Os,O2, and mtune i486/pentium3/pentium4). I’m going to go and also try on a more modern x86 OS too and some other systems.

Assuming it all goes good I think we should remove all the useless ternaries but we should also perhaps put in an effort to figure out why they were there: I thought they were there to prevent branches, but it may be that we just copied the behaviour from openssl and assumed it was preventing branches.

sipa commented at 1:48 am on July 26, 2020: contributor

I believe it very well may have just been copied from OpenSSL. I vaguely remember looking at code like this: https://github.com/openssl/openssl/blob/OpenSSL_0_9_7-stable/crypto/bn/bn_asm.c#L468.

gmaxwell commented at 2:05 am on July 26, 2020: contributor

FWIW w/ default current master emits a branch on x86 alpine linux (GCC 9.3.0) on the same ECDH sign line as on PPC64LE.

gmaxwell commented at 2:34 am on July 26, 2020: contributor

Okay, so for the “sign = 2 * (u_last > 0) - 1;” on both Fedora PPC64 LE GCC 10.1.1 and Alpine i386 GCC 9.3.0 which emit a branch there, the issue is avoided by using “sign = -(u_last <= 0)|1;”

There are a bunch of candidate formulations, what worries me is that there is nothing about the original code that is obviously wrong. I’m concerned that the codebase has many other cases exposed to randomly putting out branches whenever a code change makes the compiler think it would be expensive to emit a shift or multiply at that location. :-/ (e.g. I can’t see a lesson we can learn here about what not to do).

The alternative “sign = -(u_last < 0)|1;” is not an equivalent expression but also passes the tests and I think seems more natural. So assuming someone reviews the surrounding code and it’s okay to have 0 end up with a sign of 1 instead of -1, and that this doesn’t compile into anything awful on x86_64, I thinks that is probably what it should use.

We should probably figure out which of the alternatives are faster/cleaner and then just try out a complete suite of fixes (including getting rid of all the pointless ternaries, since we can’t find any place where they help and we’ve found a place where they hurt) on as many targets as we can.

real-or-random commented at 9:13 am on July 26, 2020: contributor

“sign = 2 * (u_last > 0) - 1;”

Oh, the better fix is #741, which just needs about one more ACK. I just wanted to point out that we reviewed the surrounding code already and I had not remembered earlier that it just removes the variable entirely. Sorry for wasting some of your time.

gmaxwell commented at 9:26 am on July 26, 2020: contributor

Oh, sounds fine to me! I’m don’t mind chasing some blind alleys so long as things improve.

real-or-random commented at 9:32 am on July 26, 2020: contributor

I believe it very well may have just been copied from OpenSSL. I vaguely remember looking at code like this: https://github.com/openssl/openssl/blob/OpenSSL_0_9_7-stable/crypto/bn/bn_asm.c#L468.

I’m sure it was copied from OpenSSL, the code says so: https://github.com/bitcoin-core/secp256k1/blob/master/src/scalar_8x32_impl.h#L263

The code was written in this 21-years old commit https://github.com/openssl/openssl/commit/fb81ac5e6be4041e59df9362cd63880b21e2cafc by @dot-asm. I’m sure I have no clue what I did 20 years ago but maybe we’re lucky. @dot-asm, do you remember the reason for the ?1:0 in the macros? I could believe that you just were not aware that C guarantees that the result of a comparison is 0 or 1, or that you were concerned that reviewers are not aware. Or do you think there was a deeper reason?

There are a bunch of candidate formulations, what worries me is that there is nothing about the original code that is obviously wrong. I’m concerned that the codebase has many other cases exposed to randomly putting out branches whenever a code change makes the compiler think it would be expensive to emit a shift or multiply at that location. :-/ (e.g. I can’t see a lesson we can learn here about what not to do).

Yeah, and I believe it’s getting worse over time with more advanced optimizations. Sometimes a big hammers like in #728 just does it but this would not acceptable here for performance reasons.

gmaxwell commented at 11:17 am on July 26, 2020: contributor

I’ve now tested a bunch of older compilers on x86 and actually find the newer ones seem to be generally more reliable– e.g. you don’t have to go far back with clang before its spewing branches everywhere– e.g. managing to turn scalar_cmov into a ton of branches. GCC 2.95 does too, but it looks like anything ECGS or later only rarely has issues (like we’re discussing in this PR). Unfortunately, I think the story is really just that the compiler behaviour isn’t stable. When we wrote a bunch of this library we were in an island of good behaviour by compilers, but it might have just been luck.

Once we’ve fixed what we can for more important platforms, I’ll setup a test to run it on a metric boatload of targets so we can get a better idea of what historical stability has been like. Unfortunately the intersection of interesting-to-target and can-run-valgrind isn’t perfect.

real-or-random commented at 11:33 am on July 26, 2020: contributor

Once we’ve fixed what we can for more important platforms, I’ll setup a test to run it on a metric boatload of targets so we can get a better idea of what historical stability has been like. Unfortunately the intersection of interesting-to-target and can-run-valgrind isn’t perfect.

Are we aware of further issues on a more important platform?

There’s #711, which is also a constant-time PR that needs review.

It would be neat to have set of compilers and targets, where we can say we consider it a real bug for -O2 and above and maybe -Os. One issue is that people may use their own compiler flags for optimizations, maybe we should at least emit a warning in the Makefile if CFLAGS is defined.

dot-asm commented at 12:12 pm on July 26, 2020: none

@dot-asm, do you remember the reason for the ?1:0 in the macros?

I don’t really feel that responsible for the ?1:0 per se, because it was itself copied from elsewhere. And as already implied, most likely reason for it is uncertainty in what a buggy compiler might do. At least it’s known that historically there were workarounds for compiler quirks/bugs and this just might be one of them. Arguably obsolete by now.

real-or-random commented at 12:45 pm on July 26, 2020: contributor

@dot-asm, do you remember the reason for the ?1:0 in the macros?

I don’t really feel that responsible for the ?1:0 per se, because it was itself copied from elsewhere. And as already implied, most likely reason for it is uncertainty in what a buggy compiler might do. At least it’s known that historically there were workarounds for compiler quirks/bugs and this just might be one of them. Arguably obsolete by now.

Ok, thanks for the quick reply! By the way, do you know where you got this code from? (I don’t think it’s relevant then, I’m just curious.)

I’ll update the PR then to remove the ternaries also in the other file.

real-or-random force-pushed on Jul 26, 2020

real-or-random marked this as ready for review on Jul 26, 2020

Remove redundant "? 1 : 0" after comparisons in scalar code

This prevents GCC from generating branches on PowerPC in certain
cases.

Fixes #771.

5b196338f0

real-or-random force-pushed on Jul 26, 2020

real-or-random commented at 1:00 pm on July 26, 2020: contributor

Okay, ready for review.

dot-asm commented at 2:19 pm on July 26, 2020: none

By the way, do you know where you got this code from?

Hmmm… You know, it might happen that I’m actually responsible for it after all… Note that it’s in BN_MULT_HIGH ~~option~~section and I do remember myself looking for a way to get rid of branch on Alpha where the comparison should map into single instruction… It’s not unlike that after cross-checking different compilers on different platforms (by modelling on non-Alpha obviously) a conclusion was drawn that explicit 1:0 is more likely to give desirable outcome across platforms. In other words it’s not necessarily about specific bug in specific compiler, it might happen it’s more about looking for specific outcome across multiple compilers… And now the circle seems to close, as this PR suggests that most like outcome is achieved if you don’t have 1:0…

On side note, I don’t see any difference in generated code on PPC… Out of curiosity, which compiler are you looking at? I’m looking at gcc 9.3…

gmaxwell commented at 2:28 pm on July 26, 2020: contributor

We see the ternary causing branches on GCC 10.1.1 (Fedora 32) on PPC64LE. … but not always, it does it in some cases and not others (in particular, it looks like its doing it when its switching on a 32-bit variable but not a 64-bit one?)

dot-asm commented at 2:40 pm on July 26, 2020: none

We see the ternary causing branches on GCC 10.1.1 … when switching on 32-bit variable

9.3 doesn’t and instead actually does something cool. It performs 64-bit subtraction of originally 32-bit values and then shift by 63. Oh well… I apologize for distraction. Cheers:-)

real-or-random commented at 2:44 pm on July 26, 2020: contributor

@dot-asm

Here’s a godbolt playground: https://godbolt.org/z/3fv385

Here GCC 9 (and other versions) generate branches, notice the jump labels in the output. If you remove the ? 1 : 0 in line 287, the branches disappear.

gmaxwell commented at 2:51 pm on July 26, 2020: contributor

@real-or-random Your update is clean on the PPC host in O2 for both 32 and 64 bit scalar, hurrah… so now that this passes, I attempt -Os on PPC64LE and I am filled with sadness because -Os does something to the binary that makes valgrind unusable (makes it think that all zero initialized stack constructed variables are uninitilized or something along those lines), presumably a valgrind bug.

I will figure it out later after testing this master + this PR on a bunch of configurations.

gmaxwell commented at 2:05 am on July 27, 2020: contributor

Preliminary report:

 0Tested with and without endomorphism, tested 32/64/asm scalar+group (as many as applicable)
 1
 2i386 Alpine Linux GCC 9.3 -O3
 3i386 Alpine Linux GCC 9.3 -O2
 4i386 Alpine Linux GCC 9.3 -Os
 5i386 Alpine Linux Clang 10.0 -O3
 6i386 Alpine Linux Clang 10.0 -O2
 7i386 Alpine Linux Clang 10.0 -Os
 8i386 Debian8 GCC 4.9.2 -O3
 9i386 Debian8 GCC 4.9.2 -O2
10i386 Debian8 GCC 4.9.2 -Os
11*FAIL* i386 Debian8 Clang 3.5.0-10 -O3 *FAIL* (branches everywhere)
12*FAIL* i386 Debian8 Clang 3.5.0-10 -O2 *FAIL* (branches everywhere)
13*FAIL* i386 Debian8 Clang 3.5.0-10 -Os *FAIL* (branches everywhere)
14*FAIL* i386 Debian8 GCC 2.95.3 -O2 *FAIL*  (I tested something this old because it's similar vintage to the ?1:0 code, and also still in a little use)
15*FAIL* i386 Debian8 GCC 2.95.3 -Os *FAIL*  (I tested something this old because it's similar vintage to the ?1:0 code, and also still in a little use)
16*FAIL* i386 Debian8 TCC 0.9.27 *FAIL* (similar behavior to GCC 2.95.3; but wow TCC compiles fast)
17arm7 Debian8 GCC 4.9.2 -O3
18arm7 Debian8 GCC 4.9.2 -O2
19arm7 Debian8 GCC 4.9.2 -Os
20*FAIL* arm7 Debian8 Clang 3.5.10-10 -O3 *FAIL* (like on i386; also ASM totally fails to compile)
21*FAIL* arm7 Debian8 Clang 3.5.10-10 -O2 *FAIL* (like on i386; also ASM totally fails to compile)
22*FAIL* arm7 Debian8 Clang 3.5.10-10 -Os *FAIL* (like on i386; also ASM totally fails to compile)
23PPC64LE Fedora 32 GCC 10.1.1 -O3
24PPC64LE Fedora 32 GCC 10.1.1 -O2
25*FAIL* PPC64LE Fedora 32 GCC 10.1.1 -Os *FAIL*  (looks like GCC -Os breaks valgrind completely, failure isn't CT specific)
26PPC64LE Fedora 32 Clang 10.0.0-2 -O3
27PPC64LE Fedora 32 Clang 10.0.0-2 -O2
28*FAIL* PPC64LE Fedora 32 Clang 10.0.0-2 -Os *FAIL* (This is due to a longstanding cross platform clang miscompliation bug[1], not a CT issue)
29AMD64 Fedora 32 GCC 10.1.1 -O3
30AMD64 Fedora 32 GCC 10.1.1 -O2
31AMD64 Fedora 32 GCC 10.1.1 -Os
32*FAIL* AMD64 Fedora 32 Clang 10.0.0 -Os *FAIL* (emits a branch at secp256k1_int_cmov (util.h:205))
33AMD64 Fedora 32 Clang 10.0.0 -O3
34AMD64 Fedora 32 Clang 10.0.0 -O2
35
36(I also tested the PR on BE sparc, and i386 haiku but can't valgrind in either or those places)
37
38This PR is a strict reduction in failures: everything that fails now failed before and some things that pass now failed before.
39
40[1] Clang sometimes replaces struct assignments with memcpys without considering the possibility that the assigned structs could overlap (e.g. by being an assign to self.).  Clang's authors argue that this is a valgrind FP because the compiler is allowed to use "undefined behavior", which is true but they are being foolish in this case:  Memcpy is implemented by libc, which they have no visibility or control over and libc can change out from under them to make UB like this unsafe.  (In fact, it has done so before-- it used to be safe with overlaps, now its is only safe for overlaps that are assign to self).
41
42TODO:
43 Must test: ARM8, More recent clang on ARM7. More recent GCC on ARM7. MIPS (common embedded platform and valgrind works there)
44 Fix ASM compile failure on clang w/ ARM7. (src/asm/field_10x26_arm.s:118:2: error: instruction requires: armv6t2, etc)
45 Figure out what is the earliest version of clang that even kinda works.
46 ICC, Maybe: Open64, Compcert
47 ARM compiler? (used to be extremely popular in embedded, not sure if it is anymore)
48 Is it possible to test MSVC using wine and animal sacrifice?
49 Fix clang 10 AMD64 (or at least report it to clang developers)

real-or-random commented at 9:38 am on July 27, 2020: contributor

Very nice.

TODO: Must test: ARM8, More recent clang on ARM7. More recent GCC on ARM7. MIPS (common embedded platform and valgrind works there)

That sounds reasonable. I think we could even merge this now but if you want, let’s see if it does not make things worse there either.

Fix ASM compile failure on clang w/ ARM7. (src/asm/field_10x26_arm.s:118:2: error: instruction requires: armv6t2, etc)

See #612 for context.

Fix clang 10 AMD64 (or at least report it to clang developers)

Hm I believe this is the same issue as in memczero. I’m just surprised that you can only see this for -Os, here it branches on O2/O3 too:

https://godbolt.org/z/57Ya89

I propose we should get rid of secp256k1_int_cmov and use memczero instead, that’s simpler. I suspect it costs half of a microsecond but it’s only used in pubkey recovery. :shrug: (Anyway, it’s hard to write obscure code to annoy the optimizer and at the same time expect performance. If we want performance, we should consider having asm implementations for all the cmov functions at least on x86_64… )

gmaxwell commented at 10:39 am on July 27, 2020: contributor

secp256k1_int_cmov doesn’t do anything the other cmovs don’t do– so if it’s trouble then why won’t the other ones (eventually) be trouble when the compiler gods are more angry? In the current use– it could be a memczero just fine, so I’d be fine with that change and if you want to add that to this PR that would be okay with me. Though I wouldn’t be too surprised if some future change needs an actual cmov.

If you wanted to get this merged right away that would be okay with me. I do think it would be best to at least get tests with current compilers on arm7 and arm8– just to guard against the unlikely case that it makes something important worse. I’ll try to do that in the next 24 hours.

There is probably going to need to be some document that lists what compilers the software has been tested on or something. I’m not sure. The valgrind test is easy and has so far shown to be pretty reliable, but we’re hamstrung by the fact that it can’t be run on embedded targets which are the most vulnerable to sidechannel attacks.

real-or-random commented at 12:04 pm on July 27, 2020: contributor

secp256k1_int_cmov doesn’t do anything the other cmovs don’t do– so if it’s trouble then why won’t the other ones (eventually) be trouble when the compiler gods are more angry? In the current use– it could be a memczero just fine, so I’d be fine with that change and if you want to add that to this PR that would be okay with me. Though I wouldn’t be too surprised if some future change needs an actual cmov.

I agree but I don’t have a much better idea currently as a quick fix. The only problematic compiler seems clang and clang seems easy to convince to use a real cmov by the ternary operator but then other compilers insert branches and I’d like to avoid compiler-conditional compilation at the moment. We may want to do this if clang starts “optimizing” the cmov functions in the arithmetic code…

If you wanted to get this merged right away that would be okay with me. I do think it would be best to at least get tests with current compilers on arm7 and arm8– just to guard against the unlikely case that it makes something important worse. I’ll try to do that in the next 24 hours.

Great!

There is probably going to need to be some document that lists what compilers the software has been tested on or something. I’m not sure. The valgrind test is easy and has so far shown to be pretty reliable, but we’re hamstrung by the fact that it can’t be run on embedded targets which are the most vulnerable to sidechannel attacks.

Yep. If I had a lot of time, I’d setup a fork of the compiler explorer in order to get a shitload of compilers and targets (it has docker images for the compilers), and then look automatically for branch instructions in a given list of functions. We could even make it look for variable-time instructions etc.

Suppress a harmless variable-time optimization by clang in _int_cmov

Follow up on 52a03512c1d800603b5c923c1a28bdba12dadb30

67a429f31f

real-or-random commented at 12:39 pm on July 27, 2020: contributor

secp256k1_int_cmov doesn’t do anything the other cmovs don’t do– so if it’s trouble then why won’t the other ones (eventually) be trouble when the compiler gods are more angry? In the current use– it could be a memczero just fine, so I’d be fine with that change and if you want to add that to this PR that would be okay with me. Though I wouldn’t be too surprised if some future change needs an actual cmov.

I agree but I don’t have a much better idea currently as a quick fix. The only problematic compiler seems clang and clang seems easy to convince to use a real cmov by the ternary operator but then other compilers insert branches and I’d like to avoid compiler-conditional compilation at the moment. We may want to do this if clang starts “optimizing” the cmov functions in the arithmetic code…

Pushed but I kept the function for now and applied the same “fix” as in memczero. I felt it’s somewhat inconsistent to remove it given that we have also _scalar_cmov which is also used only to overwrite with zeros if signing fails (and could be replaced memczero too but that’s another discussion then).

in src/util.h:203 in 67a429f31f

196@@ -197,10 +197,15 @@ static SECP256K1_INLINE void memczero(void *s, size_t len, int flag) {
197 /** If flag is true, set *r equal to *a; otherwise leave it. Constant-time.  Both *r and *a must be initialized and non-negative.*/
198 static SECP256K1_INLINE void secp256k1_int_cmov(int *r, const int *a, int flag) {
199     unsigned int mask0, mask1, r_masked, a_masked;
200+    /* Access flag with a volatile-qualified lvalue.
201+       This prevents clang from figuring out (after inlining) that flag can
202+       take only be 0 or 1, which leads to variable time code. */
203+    volatile int vflag = flag;

elichai commented at 1:10 pm on July 27, 2020:

you can make that volatile unsigned int and then remove the casting a few lines below?

real-or-random commented at 1:16 pm on July 27, 2020:

Hm, I don’t see why this is better. I prefer the current form because then it’s just copy/paste from above. (I know duplicated code is to be avoided but duplicated and modified code is even worse.)

gmaxwell commented at 5:03 pm on July 27, 2020: contributor

and then look automatically for branch instructions in a given list of functions.

Sadly that doesn’t work well because branches on non-secret inputs are fine. (or rather, it works fine to do it by hand, it didn’t seem to be something we could automate)

real-or-random commented at 5:15 pm on July 27, 2020: contributor

and then look automatically for branch instructions in a given list of functions.

Sadly that doesn’t work well because branches on non-secret inputs are fine. (or rather, it works fine to do it by hand, it didn’t seem to be something we could automate)

Yeah it would certainly not be trivial, also with declassifications etc. I don’t know. It would envision it rather as a thing where you audit the branches once and then it’s some kind of CI thing that will tell you if a new PR will introduce new branches. So it’s not perfect, but at least any changes in branching behavior get your attention (like with test coverage tools etc.)

gmaxwell commented at 1:31 am on July 28, 2020: contributor

I confirm the latest commit fixes

AMD64 Fedora 32 Clang 10.0.0 -Os.

And the following also pass:

arm7 Arch GCC 9.3.0 -O3 arm7 Arch GCC 9.3.0 -O2 arm7 Arch GCC 9.3.0 -Os arm7 Arch Clang 10.0.1 -O3 arm7 Arch Clang 10.0.1 -O2 arm7 Arch Clang 10.0.1 -Os

I just found an arm8 host, testing will be forthcoming.

Also: dear lord ecdsa_verify: min 3088us / avg 3089us / max 3090us (i.MX53 no bignum, no asm) perhaps I should have looked harder for a faster device to test on…

gmaxwell commented at 4:17 am on July 28, 2020: contributor

Okay, we can add the following as passes:

armv8 Ubuntu GCC 9.3.0 -O3 armv8 Ubuntu GCC 9.3.0 -O2 armv8 Ubuntu GCC 9.3.0 -Os armv8 Ubuntu Clang 10.0.0-4 -O3 armv8 Ubuntu Clang 10.0.0-4 -O2 armv8 Ubuntu Clang 10.0.0-4 -Os

As having no variable time bugs with this PR.

However, there are spurious failures, which I’ll open an issue for.

gmaxwell commented at 6:50 am on July 28, 2020: contributor

AMD64 ICC 19.1.2.254

 0==407113== Conditional jump or move depends on uninitialised value(s)
 1==407113==    at 0x485FB93: secp256k1_ec_pubkey_create (secp256k1.c:568)
 2==407113==    by 0x401490: main (valgrind_ctime_test.c:56)
 3==407113== 
 4==407113== Conditional jump or move depends on uninitialised value(s)
 5==407113==    at 0x48674AA: secp256k1_scalar_is_zero (secp256k1.c:521)
 6==407113==    by 0x48674AA: secp256k1_scalar_set_b32_seckey (scalar_impl.h:61)
 7==407113==    by 0x48674AA: secp256k1_scalar_cmov (secp256k1.c:496)
 8==407113==    by 0x48674AA: secp256k1_ecdsa_sign_inner (secp256k1.c:488)
 9==407113==    by 0x48670BE: secp256k1_ecdsa_sign_recoverable (main_impl.h:132)
10==407113==    by 0x401821: main (valgrind_ctime_test.c:81)

Your volatile tricks do not fool ICC.

real-or-random commented at 9:20 am on July 28, 2020: contributor

AMD64 ICC 19.1.2.254
0==407113==    by 0x48674AA: secp256k1_scalar_cmov (secp256k1.c:496)
Your volatile tricks do not fool ICC.

That’s because that function does not use this ugly hack yet. Sad. https://github.com/bitcoin-core/secp256k1/blob/5e1c885efb0f400d024efc259eddd6a5ee9cba7b/src/scalar_8x32_impl.h#L721-L725

secp256k1_int_cmov doesn’t do anything the other cmovs don’t do– so if it’s trouble then why won’t the other ones (eventually) be trouble when the compiler gods are more angry?

Ok, that escalated more quickly than I expected.

What do you think? Should we handle this in this PR, in another PR (that maybe looks into the cmovs in more detail) or not at all (because ICC)?

gmaxwell commented at 9:42 am on July 28, 2020: contributor

@real-or-random Your call! I ACK this PR as is, though I’d like to continue this effort in another PR if we go the route of merging this.

That’s because that function does not use this ugly hack yet. Sad.

Lol! hah. I called it and I didn’t even notice.

I hope you appreciate my motivation for testing test weird stuff is mostly because they act as proxies for things we can’t so easily test (e.g. future compilers; or how current compilers would behave after some future PRs move code around in the library) or expose errors that by luck cancel each other out on common developers systems (after all, we’ve already fixed most of the easy ones) .

ICC is a moderately important compiler in industry– I think it results in the fastest libsecp256k1 binary that I’ve tested so far (4% faster than GCC) – but I also think it’s a lot less critical than gcc/clang. We shouldn’t hold up improvements just waiting on more improvements. I’ve got a feeling we’re going to be chasing these things for a while.

I think I had a little batching preference before because its easier to test a bunch of changes at once when I’m testing across a bunch of systems. But I’ve already tested everything in this PR, as is, against a sufficient set of targets.

real-or-random cross-referenced this on Jul 28, 2020 from issue secp256k1_scalar_cmov not constant-time on ICC by real-or-random

real-or-random commented at 11:02 am on July 28, 2020: contributor

I think I had a little batching preference before because its easier to test a bunch of changes at once when I’m testing across a bunch of systems. But I’ve already tested everything in this PR, as is, against a sufficient set of targets.

I agree. Let’s get this merged then. By the way, if you write your ACKs in the form “ACK commit-id”, then they’ll be included in the merge commit by our GitHub merge script https://github.com/bitcoin-core/bitcoin-maintainer-tools/blob/master/github-merge.py.

real-or-random commented at 11:04 am on July 28, 2020: contributor

@elichai I’m trying to press the “re-request review” button but it does not work, so I’m mentioning you here instead. :)

gmaxwell commented at 11:23 am on July 28, 2020: contributor

I reviewed this and tested the heck out of it. It improves some systems and doesn’t appear to make any I tested worse. I also did limited testing for performance (I only tested gcc 10+x86_64+noasm) and this PR didn’t cause any measurable performance loss.

ACK 67a429f31fd3d1b37c5365cc58b70588b8645d62

elichai commented at 1:41 pm on July 28, 2020: contributor

locally with clang-10 on these lines: https://github.com/bitcoin-core/secp256k1/blob/40412b1930483fe5d2ff184d071d2b5125f84970/src/secp256k1.c#L519:L522

In master I get:

0        0010379e 48 85 c9        TEST       param_4,param_4
1        001037a1 74 07           JZ         LAB_001037aa
2        001037a6 f7 d8           NEG        EAX
3        001037a8 21 01           AND        dword ptr [param_4],EAX

ghidra disassembly:

0  if (local_230 != (uint *)0x0) {
1    *local_230 = *local_230 & -uVar6;
2  }

but with this PR I get:

0        0010379e 48 85 c9        TEST       param_4,param_4
1        001037a1 74 15           JZ         LAB_001037b8
2        001037b1 83 c0 ff        ADD        EAX,-0x1
3        001037b4 23 01           AND        EAX,dword ptr [param_4]
4        001037b6 89 01           MOV        dword ptr [param_4],EAX

ghidra disassembly:

0  if (local_230 != (uint *)0x0) {
1    *local_230 = uVar11 - 1 & *local_230;
2  }

both are constant time but weirdly different

real-or-random commented at 1:47 pm on July 28, 2020: contributor

@elichai Well ok but is this an issue?

And are you okay with my comment above on casting early? #772 (review)

elichai commented at 1:53 pm on July 28, 2020: contributor

Nope, no issues, just wanted to report my findings. I did a bunch of local testing with compilers and flags and nothing got worse on my arch + compilers so if it improves the situation on other targets(gmaxwell says it does) then that’s great :)

tACK 67a429f31fd3d1b37c5365cc58b70588b8645d62

real-or-random merged this on Jul 28, 2020

real-or-random closed this on Jul 28, 2020

dot-asm commented at 9:12 pm on August 2, 2020: none

It’s not exactly my business and it might be not as relevant, but in this time and age one can do better than being left at compiler’s mercy to interpret the condition in question as branch or not. I mean the fact that it does it one way today is no guarantee that it won’t do it otherwise tomorrow, with different flag or not. The “secret” keyword is __int128, the type supported by majority of contemporary 64-bit compilers, gcc, clang, icc, xlc (for Linux)…

gmaxwell commented at 2:12 am on August 3, 2020: contributor

The library uses it, on 64-bit. Most of the code we were discussing was the 32-bit code, while the 64-bit code is similar but uses larger limbs and works into a 128-bit int. Unfortunately it isn’t a silver bullet either because we still need carries there too, just far fewer of them.

[But thanks for the comment, if we hadn’t been aware of that it would be a major revelation and much appreciated!]

dot-asm commented at 8:26 am on August 3, 2020: none

What I tried to hint at is that one can use 128-bit additions to avoid handling carries with conditionals. For example muladd could look like this:

0uint128_t t = (uint128_t)a * b + c0;
1c0 = (uint64_t)t;
2t = (t>>64) + c1;
3c1 = (uint64_t)t;
4c2 += (uint64_t)(t>>64);

This can be naturally extended to 32-bit platforms by replacing ~~uint64_t~~ 64 with ~~uint32_t~~ 32 and uint128_t with uint64_t. And I kind of assumed that it was like that on 32-bit platforms…

sipa commented at 1:24 am on August 7, 2020: contributor

@dot-asm Of course… I don’t understand why I never considered that, as the exact same code was already using doubly-wide types for the multiplication…

See #786.

sipa cross-referenced this on Aug 7, 2020 from issue Use double-wide types for additions in scalar code by sipa

jasonbcox referenced this in commit 62fa9a6ee5 on Sep 27, 2020

deadalnix referenced this in commit 80b968b027 on Sep 28, 2020

Improve constant-timeness on PowerPC #772