Attempt at resolving #771 .
This surprisingly seems to improve the situation at least for the compilers available on godbolt.
Attempt at resolving #771 .
This surprisingly seems to improve the situation at least for the compilers available on godbolt.
Good news, bad news! That indeed fixes the 32-bit scalar code on Fedora 32 GCC 10.1.1 PPC64LE!
Bad news: I’m reasonably confident that that ternary operator is there because for i386 cmov was generally slower than branching until core2, and the ternary was needed to get branchless behaviour there. I don’t currently have a ready to go i386 testing setup– but I’ll set one up if you’ll work with me on this.
PPC64LE also is non-constant time on:
sign = 2 * (u_last > 0) - 1;
I didn’t see a clear way to fix that.
(also I think this highlights a greater problem that we can’t trust statements like this to be constant time. I think we have a lot of them, they’re not currently an issue through luck of the optimizer).
I’m happy to test PPC64LE btw (btw: luke-jr, grubles, bluematt, and td-linux on IRC also have ppc64 systems– probably other people too, it’s probably easier to get developers to test on PPC64LE right now, though i386 is still a relevant embedded/old system target).
Aside, the 64-bit scalar code is also full of the “pointless” expressions– on ppc64 w/ this compiler they don’t compile into something valgrind complains about. … and also I think they’re there for the same reason they are there in the 32-bit code.
(aside, wow travis takes an absurdly long time to run)
Good news, bad news! That indeed fixes the 32-bit scalar code on Fedora 10 PPC64LE!
Hooray.
Bad news: I’m reasonably confident that that ternary operator is there because for i386 cmov was generally slower than branching until core2, and the ternary was needed to get branchless behaviour there.
Yeah, given that this syntax is literally redundant, I suspected it was there for a reason.
I had a closer look now. It seems to me that the only “bad” instruction is c2 += (c1 < th) ? 1 : 0;
in muladd
, and this is compiled to a branch only in the first of three subsequent muladd
s. It seems the compiler overdoes reordering here, and may get it wrong somehow as a a result of this.
Changing this single line to c2 += (c1 < th)
or even c2 += (c1 >= th) ? 0 : 1;
seems to do the trick. Of course all of this is somewhat fragile but I’m still somewhat confident that the latter won’t change a lot for i386. I think we should document then, why these conditional operators are there.
I don’t currently have a ready to go i386 testing setup– but I’ll set one up if you’ll work with me on this.
Cool, let’s do this.
PPC64LE also is non-constant time on:
0 sign = 2 * (u_last > 0) - 1;
I didn’t see a clear way to fix that.
Hm, yeah, that one seems harder. Have you tried a shift or a conditional operator (haha) or a conditional operator with a reversed condition (hahaha)?
(also I think this highlights a greater problem that we can’t trust statements like this to be constant time. I think we have a lot of them, they’re not currently an issue through luck of the optimizer).
Yeah but sadly that’s still state of the art for this stuff.
Great news, valgrind still passes with this on i386 debian 8 (gcc 4.9.2), I’ve tried a mixture of compiler options (Os,O2, and mtune i486/pentium3/pentium4). I’m going to go and also try on a more modern x86 OS too and some other systems.
Assuming it all goes good I think we should remove all the useless ternaries but we should also perhaps put in an effort to figure out why they were there: I thought they were there to prevent branches, but it may be that we just copied the behaviour from openssl and assumed it was preventing branches.
Okay, so for the “sign = 2 * (u_last > 0) - 1;” on both Fedora PPC64 LE GCC 10.1.1 and Alpine i386 GCC 9.3.0 which emit a branch there, the issue is avoided by using “sign = -(u_last <= 0)|1;”
There are a bunch of candidate formulations, what worries me is that there is nothing about the original code that is obviously wrong. I’m concerned that the codebase has many other cases exposed to randomly putting out branches whenever a code change makes the compiler think it would be expensive to emit a shift or multiply at that location. :-/ (e.g. I can’t see a lesson we can learn here about what not to do).
The alternative “sign = -(u_last < 0)|1;” is not an equivalent expression but also passes the tests and I think seems more natural. So assuming someone reviews the surrounding code and it’s okay to have 0 end up with a sign of 1 instead of -1, and that this doesn’t compile into anything awful on x86_64, I thinks that is probably what it should use.
We should probably figure out which of the alternatives are faster/cleaner and then just try out a complete suite of fixes (including getting rid of all the pointless ternaries, since we can’t find any place where they help and we’ve found a place where they hurt) on as many targets as we can.
“sign = 2 * (u_last > 0) - 1;”
Oh, the better fix is #741, which just needs about one more ACK. I just wanted to point out that we reviewed the surrounding code already and I had not remembered earlier that it just removes the variable entirely. Sorry for wasting some of your time.
I believe it very well may have just been copied from OpenSSL. I vaguely remember looking at code like this: https://github.com/openssl/openssl/blob/OpenSSL_0_9_7-stable/crypto/bn/bn_asm.c#L468.
I’m sure it was copied from OpenSSL, the code says so: https://github.com/bitcoin-core/secp256k1/blob/master/src/scalar_8x32_impl.h#L263
The code was written in this 21-years old commit https://github.com/openssl/openssl/commit/fb81ac5e6be4041e59df9362cd63880b21e2cafc by @dot-asm. I’m sure I have no clue what I did 20 years ago but maybe we’re lucky. @dot-asm, do you remember the reason for the ?1:0
in the macros? I could believe that you just were not aware that C guarantees that the result of a comparison is 0 or 1, or that you were concerned that reviewers are not aware. Or do you think there was a deeper reason?
There are a bunch of candidate formulations, what worries me is that there is nothing about the original code that is obviously wrong. I’m concerned that the codebase has many other cases exposed to randomly putting out branches whenever a code change makes the compiler think it would be expensive to emit a shift or multiply at that location. :-/ (e.g. I can’t see a lesson we can learn here about what not to do).
Yeah, and I believe it’s getting worse over time with more advanced optimizations. Sometimes a big hammers like in #728 just does it but this would not acceptable here for performance reasons.
I’ve now tested a bunch of older compilers on x86 and actually find the newer ones seem to be generally more reliable– e.g. you don’t have to go far back with clang before its spewing branches everywhere– e.g. managing to turn scalar_cmov into a ton of branches. GCC 2.95 does too, but it looks like anything ECGS or later only rarely has issues (like we’re discussing in this PR). Unfortunately, I think the story is really just that the compiler behaviour isn’t stable. When we wrote a bunch of this library we were in an island of good behaviour by compilers, but it might have just been luck.
Once we’ve fixed what we can for more important platforms, I’ll setup a test to run it on a metric boatload of targets so we can get a better idea of what historical stability has been like. Unfortunately the intersection of interesting-to-target and can-run-valgrind isn’t perfect.
Once we’ve fixed what we can for more important platforms, I’ll setup a test to run it on a metric boatload of targets so we can get a better idea of what historical stability has been like. Unfortunately the intersection of interesting-to-target and can-run-valgrind isn’t perfect.
Are we aware of further issues on a more important platform?
There’s #711, which is also a constant-time PR that needs review.
It would be neat to have set of compilers and targets, where we can say we consider it a real bug for -O2 and above and maybe -Os. One issue is that people may use their own compiler flags for optimizations, maybe we should at least emit a warning in the Makefile if CFLAGS is defined.
@dot-asm, do you remember the reason for the ?1:0 in the macros?
I don’t really feel that responsible for the ?1:0 per se, because it was itself copied from elsewhere. And as already implied, most likely reason for it is uncertainty in what a buggy compiler might do. At least it’s known that historically there were workarounds for compiler quirks/bugs and this just might be one of them. Arguably obsolete by now.
@dot-asm, do you remember the reason for the ?1:0 in the macros?
I don’t really feel that responsible for the ?1:0 per se, because it was itself copied from elsewhere. And as already implied, most likely reason for it is uncertainty in what a buggy compiler might do. At least it’s known that historically there were workarounds for compiler quirks/bugs and this just might be one of them. Arguably obsolete by now.
Ok, thanks for the quick reply! By the way, do you know where you got this code from? (I don’t think it’s relevant then, I’m just curious.)
I’ll update the PR then to remove the ternaries also in the other file.
This prevents GCC from generating branches on PowerPC in certain
cases.
Fixes #771.
By the way, do you know where you got this code from?
Hmmm… You know, it might happen that I’m actually responsible for it after all… Note that it’s in BN_MULT_HIGH optionsection and I do remember myself looking for a way to get rid of branch on Alpha where the comparison should map into single instruction… It’s not unlike that after cross-checking different compilers on different platforms (by modelling on non-Alpha obviously) a conclusion was drawn that explicit 1:0 is more likely to give desirable outcome across platforms. In other words it’s not necessarily about specific bug in specific compiler, it might happen it’s more about looking for specific outcome across multiple compilers… And now the circle seems to close, as this PR suggests that most like outcome is achieved if you don’t have 1:0…
On side note, I don’t see any difference in generated code on PPC… Out of curiosity, which compiler are you looking at? I’m looking at gcc 9.3…
We see the ternary causing branches on GCC 10.1.1 … when switching on 32-bit variable
9.3 doesn’t and instead actually does something cool. It performs 64-bit subtraction of originally 32-bit values and then shift by 63. Oh well… I apologize for distraction. Cheers:-)
Here’s a godbolt playground: https://godbolt.org/z/3fv385
Here GCC 9 (and other versions) generate branches, notice the jump labels in the output. If you remove the ? 1 : 0
in line 287, the branches disappear.
@real-or-random Your update is clean on the PPC host in O2 for both 32 and 64 bit scalar, hurrah… so now that this passes, I attempt -Os on PPC64LE and I am filled with sadness because -Os does something to the binary that makes valgrind unusable (makes it think that all zero initialized stack constructed variables are uninitilized or something along those lines), presumably a valgrind bug.
I will figure it out later after testing this master + this PR on a bunch of configurations.
Preliminary report:
0Tested with and without endomorphism, tested 32/64/asm scalar+group (as many as applicable)
1
2i386 Alpine Linux GCC 9.3 -O3
3i386 Alpine Linux GCC 9.3 -O2
4i386 Alpine Linux GCC 9.3 -Os
5i386 Alpine Linux Clang 10.0 -O3
6i386 Alpine Linux Clang 10.0 -O2
7i386 Alpine Linux Clang 10.0 -Os
8i386 Debian8 GCC 4.9.2 -O3
9i386 Debian8 GCC 4.9.2 -O2
10i386 Debian8 GCC 4.9.2 -Os
11*FAIL* i386 Debian8 Clang 3.5.0-10 -O3 *FAIL* (branches everywhere)
12*FAIL* i386 Debian8 Clang 3.5.0-10 -O2 *FAIL* (branches everywhere)
13*FAIL* i386 Debian8 Clang 3.5.0-10 -Os *FAIL* (branches everywhere)
14*FAIL* i386 Debian8 GCC 2.95.3 -O2 *FAIL* (I tested something this old because it's similar vintage to the ?1:0 code, and also still in a little use)
15*FAIL* i386 Debian8 GCC 2.95.3 -Os *FAIL* (I tested something this old because it's similar vintage to the ?1:0 code, and also still in a little use)
16*FAIL* i386 Debian8 TCC 0.9.27 *FAIL* (similar behavior to GCC 2.95.3; but wow TCC compiles fast)
17arm7 Debian8 GCC 4.9.2 -O3
18arm7 Debian8 GCC 4.9.2 -O2
19arm7 Debian8 GCC 4.9.2 -Os
20*FAIL* arm7 Debian8 Clang 3.5.10-10 -O3 *FAIL* (like on i386; also ASM totally fails to compile)
21*FAIL* arm7 Debian8 Clang 3.5.10-10 -O2 *FAIL* (like on i386; also ASM totally fails to compile)
22*FAIL* arm7 Debian8 Clang 3.5.10-10 -Os *FAIL* (like on i386; also ASM totally fails to compile)
23PPC64LE Fedora 32 GCC 10.1.1 -O3
24PPC64LE Fedora 32 GCC 10.1.1 -O2
25*FAIL* PPC64LE Fedora 32 GCC 10.1.1 -Os *FAIL* (looks like GCC -Os breaks valgrind completely, failure isn't CT specific)
26PPC64LE Fedora 32 Clang 10.0.0-2 -O3
27PPC64LE Fedora 32 Clang 10.0.0-2 -O2
28*FAIL* PPC64LE Fedora 32 Clang 10.0.0-2 -Os *FAIL* (This is due to a longstanding cross platform clang miscompliation bug[1], not a CT issue)
29AMD64 Fedora 32 GCC 10.1.1 -O3
30AMD64 Fedora 32 GCC 10.1.1 -O2
31AMD64 Fedora 32 GCC 10.1.1 -Os
32*FAIL* AMD64 Fedora 32 Clang 10.0.0 -Os *FAIL* (emits a branch at secp256k1_int_cmov (util.h:205))
33AMD64 Fedora 32 Clang 10.0.0 -O3
34AMD64 Fedora 32 Clang 10.0.0 -O2
35
36(I also tested the PR on BE sparc, and i386 haiku but can't valgrind in either or those places)
37
38This PR is a strict reduction in failures: everything that fails now failed before and some things that pass now failed before.
39
40[1] Clang sometimes replaces struct assignments with memcpys without considering the possibility that the assigned structs could overlap (e.g. by being an assign to self.). Clang's authors argue that this is a valgrind FP because the compiler is allowed to use "undefined behavior", which is true but they are being foolish in this case: Memcpy is implemented by libc, which they have no visibility or control over and libc can change out from under them to make UB like this unsafe. (In fact, it has done so before-- it used to be safe with overlaps, now its is only safe for overlaps that are assign to self).
41
42TODO:
43 Must test: ARM8, More recent clang on ARM7. More recent GCC on ARM7. MIPS (common embedded platform and valgrind works there)
44 Fix ASM compile failure on clang w/ ARM7. (src/asm/field_10x26_arm.s:118:2: error: instruction requires: armv6t2, etc)
45 Figure out what is the earliest version of clang that even kinda works.
46 ICC, Maybe: Open64, Compcert
47 ARM compiler? (used to be extremely popular in embedded, not sure if it is anymore)
48 Is it possible to test MSVC using wine and animal sacrifice?
49 Fix clang 10 AMD64 (or at least report it to clang developers)
Very nice.
TODO: Must test: ARM8, More recent clang on ARM7. More recent GCC on ARM7. MIPS (common embedded platform and valgrind works there)
That sounds reasonable. I think we could even merge this now but if you want, let’s see if it does not make things worse there either.
Fix ASM compile failure on clang w/ ARM7. (src/asm/field_10x26_arm.s:118:2: error: instruction requires: armv6t2, etc)
See #612 for context.
Fix clang 10 AMD64 (or at least report it to clang developers)
Hm I believe this is the same issue as in memczero
. I’m just surprised that you can only see this for -Os, here it branches on O2/O3 too:
I propose we should get rid of secp256k1_int_cmov
and use memczero
instead, that’s simpler. I suspect it costs half of a microsecond but it’s only used in pubkey recovery. :shrug: (Anyway, it’s hard to write obscure code to annoy the optimizer and at the same time expect performance. If we want performance, we should consider having asm implementations for all the cmov functions at least on x86_64… )
secp256k1_int_cmov doesn’t do anything the other cmovs don’t do– so if it’s trouble then why won’t the other ones (eventually) be trouble when the compiler gods are more angry? In the current use– it could be a memczero just fine, so I’d be fine with that change and if you want to add that to this PR that would be okay with me. Though I wouldn’t be too surprised if some future change needs an actual cmov.
If you wanted to get this merged right away that would be okay with me. I do think it would be best to at least get tests with current compilers on arm7 and arm8– just to guard against the unlikely case that it makes something important worse. I’ll try to do that in the next 24 hours.
There is probably going to need to be some document that lists what compilers the software has been tested on or something. I’m not sure. The valgrind test is easy and has so far shown to be pretty reliable, but we’re hamstrung by the fact that it can’t be run on embedded targets which are the most vulnerable to sidechannel attacks.
secp256k1_int_cmov doesn’t do anything the other cmovs don’t do– so if it’s trouble then why won’t the other ones (eventually) be trouble when the compiler gods are more angry? In the current use– it could be a memczero just fine, so I’d be fine with that change and if you want to add that to this PR that would be okay with me. Though I wouldn’t be too surprised if some future change needs an actual cmov.
I agree but I don’t have a much better idea currently as a quick fix. The only problematic compiler seems clang and clang seems easy to convince to use a real cmov by the ternary operator but then other compilers insert branches and I’d like to avoid compiler-conditional compilation at the moment. We may want to do this if clang starts “optimizing” the cmov functions in the arithmetic code…
If you wanted to get this merged right away that would be okay with me. I do think it would be best to at least get tests with current compilers on arm7 and arm8– just to guard against the unlikely case that it makes something important worse. I’ll try to do that in the next 24 hours.
Great!
There is probably going to need to be some document that lists what compilers the software has been tested on or something. I’m not sure. The valgrind test is easy and has so far shown to be pretty reliable, but we’re hamstrung by the fact that it can’t be run on embedded targets which are the most vulnerable to sidechannel attacks.
Yep. If I had a lot of time, I’d setup a fork of the compiler explorer in order to get a shitload of compilers and targets (it has docker images for the compilers), and then look automatically for branch instructions in a given list of functions. We could even make it look for variable-time instructions etc.
Follow up on 52a03512c1d800603b5c923c1a28bdba12dadb30
secp256k1_int_cmov doesn’t do anything the other cmovs don’t do– so if it’s trouble then why won’t the other ones (eventually) be trouble when the compiler gods are more angry? In the current use– it could be a memczero just fine, so I’d be fine with that change and if you want to add that to this PR that would be okay with me. Though I wouldn’t be too surprised if some future change needs an actual cmov.
I agree but I don’t have a much better idea currently as a quick fix. The only problematic compiler seems clang and clang seems easy to convince to use a real cmov by the ternary operator but then other compilers insert branches and I’d like to avoid compiler-conditional compilation at the moment. We may want to do this if clang starts “optimizing” the cmov functions in the arithmetic code…
Pushed but I kept the function for now and applied the same “fix” as in memczero
. I felt it’s somewhat inconsistent to remove it given that we have also _scalar_cmov
which is also used only to overwrite with zeros if signing fails (and could be replaced memczero
too but that’s another discussion then).
196@@ -197,10 +197,15 @@ static SECP256K1_INLINE void memczero(void *s, size_t len, int flag) {
197 /** If flag is true, set *r equal to *a; otherwise leave it. Constant-time. Both *r and *a must be initialized and non-negative.*/
198 static SECP256K1_INLINE void secp256k1_int_cmov(int *r, const int *a, int flag) {
199 unsigned int mask0, mask1, r_masked, a_masked;
200+ /* Access flag with a volatile-qualified lvalue.
201+ This prevents clang from figuring out (after inlining) that flag can
202+ take only be 0 or 1, which leads to variable time code. */
203+ volatile int vflag = flag;
volatile unsigned int
and then remove the casting a few lines below?
and then look automatically for branch instructions in a given list of functions.
Sadly that doesn’t work well because branches on non-secret inputs are fine. (or rather, it works fine to do it by hand, it didn’t seem to be something we could automate)
and then look automatically for branch instructions in a given list of functions.
Sadly that doesn’t work well because branches on non-secret inputs are fine. (or rather, it works fine to do it by hand, it didn’t seem to be something we could automate)
Yeah it would certainly not be trivial, also with declassifications etc. I don’t know. It would envision it rather as a thing where you audit the branches once and then it’s some kind of CI thing that will tell you if a new PR will introduce new branches. So it’s not perfect, but at least any changes in branching behavior get your attention (like with test coverage tools etc.)
I confirm the latest commit fixes
AMD64 Fedora 32 Clang 10.0.0 -Os.
And the following also pass:
arm7 Arch GCC 9.3.0 -O3 arm7 Arch GCC 9.3.0 -O2 arm7 Arch GCC 9.3.0 -Os arm7 Arch Clang 10.0.1 -O3 arm7 Arch Clang 10.0.1 -O2 arm7 Arch Clang 10.0.1 -Os
I just found an arm8 host, testing will be forthcoming.
Also: dear lord ecdsa_verify: min 3088us / avg 3089us / max 3090us (i.MX53 no bignum, no asm) perhaps I should have looked harder for a faster device to test on…
Okay, we can add the following as passes:
armv8 Ubuntu GCC 9.3.0 -O3 armv8 Ubuntu GCC 9.3.0 -O2 armv8 Ubuntu GCC 9.3.0 -Os armv8 Ubuntu Clang 10.0.0-4 -O3 armv8 Ubuntu Clang 10.0.0-4 -O2 armv8 Ubuntu Clang 10.0.0-4 -Os
As having no variable time bugs with this PR.
However, there are spurious failures, which I’ll open an issue for.
AMD64 ICC 19.1.2.254
0==407113== Conditional jump or move depends on uninitialised value(s)
1==407113== at 0x485FB93: secp256k1_ec_pubkey_create (secp256k1.c:568)
2==407113== by 0x401490: main (valgrind_ctime_test.c:56)
3==407113==
4==407113== Conditional jump or move depends on uninitialised value(s)
5==407113== at 0x48674AA: secp256k1_scalar_is_zero (secp256k1.c:521)
6==407113== by 0x48674AA: secp256k1_scalar_set_b32_seckey (scalar_impl.h:61)
7==407113== by 0x48674AA: secp256k1_scalar_cmov (secp256k1.c:496)
8==407113== by 0x48674AA: secp256k1_ecdsa_sign_inner (secp256k1.c:488)
9==407113== by 0x48670BE: secp256k1_ecdsa_sign_recoverable (main_impl.h:132)
10==407113== by 0x401821: main (valgrind_ctime_test.c:81)
Your volatile tricks do not fool ICC.
AMD64 ICC 19.1.2.254
0==407113== by 0x48674AA: secp256k1_scalar_cmov (secp256k1.c:496)
Your volatile tricks do not fool ICC.
That’s because that function does not use this ugly hack yet. Sad. https://github.com/bitcoin-core/secp256k1/blob/5e1c885efb0f400d024efc259eddd6a5ee9cba7b/src/scalar_8x32_impl.h#L721-L725
secp256k1_int_cmov doesn’t do anything the other cmovs don’t do– so if it’s trouble then why won’t the other ones (eventually) be trouble when the compiler gods are more angry?
Ok, that escalated more quickly than I expected.
What do you think? Should we handle this in this PR, in another PR (that maybe looks into the cmovs in more detail) or not at all (because ICC)?
@real-or-random Your call! I ACK this PR as is, though I’d like to continue this effort in another PR if we go the route of merging this.
That’s because that function does not use this ugly hack yet. Sad.
Lol! hah. I called it and I didn’t even notice.
I hope you appreciate my motivation for testing test weird stuff is mostly because they act as proxies for things we can’t so easily test (e.g. future compilers; or how current compilers would behave after some future PRs move code around in the library) or expose errors that by luck cancel each other out on common developers systems (after all, we’ve already fixed most of the easy ones) .
ICC is a moderately important compiler in industry– I think it results in the fastest libsecp256k1 binary that I’ve tested so far (4% faster than GCC) – but I also think it’s a lot less critical than gcc/clang. We shouldn’t hold up improvements just waiting on more improvements. I’ve got a feeling we’re going to be chasing these things for a while.
I think I had a little batching preference before because its easier to test a bunch of changes at once when I’m testing across a bunch of systems. But I’ve already tested everything in this PR, as is, against a sufficient set of targets.
I think I had a little batching preference before because its easier to test a bunch of changes at once when I’m testing across a bunch of systems. But I’ve already tested everything in this PR, as is, against a sufficient set of targets.
I agree. Let’s get this merged then. By the way, if you write your ACKs in the form “ACK commit-id”, then they’ll be included in the merge commit by our GitHub merge script https://github.com/bitcoin-core/bitcoin-maintainer-tools/blob/master/github-merge.py.
I reviewed this and tested the heck out of it. It improves some systems and doesn’t appear to make any I tested worse. I also did limited testing for performance (I only tested gcc 10+x86_64+noasm) and this PR didn’t cause any measurable performance loss.
ACK 67a429f31fd3d1b37c5365cc58b70588b8645d62
locally with clang-10 on these lines: https://github.com/bitcoin-core/secp256k1/blob/40412b1930483fe5d2ff184d071d2b5125f84970/src/secp256k1.c#L519:L522
In master I get:
0 0010379e 48 85 c9 TEST param_4,param_4
1 001037a1 74 07 JZ LAB_001037aa
2 001037a6 f7 d8 NEG EAX
3 001037a8 21 01 AND dword ptr [param_4],EAX
ghidra disassembly:
0 if (local_230 != (uint *)0x0) {
1 *local_230 = *local_230 & -uVar6;
2 }
but with this PR I get:
0 0010379e 48 85 c9 TEST param_4,param_4
1 001037a1 74 15 JZ LAB_001037b8
2 001037b1 83 c0 ff ADD EAX,-0x1
3 001037b4 23 01 AND EAX,dword ptr [param_4]
4 001037b6 89 01 MOV dword ptr [param_4],EAX
ghidra disassembly:
0 if (local_230 != (uint *)0x0) {
1 *local_230 = uVar11 - 1 & *local_230;
2 }
both are constant time but weirdly different
@elichai Well ok but is this an issue?
And are you okay with my comment above on casting early? #772 (review)
Nope, no issues, just wanted to report my findings. I did a bunch of local testing with compilers and flags and nothing got worse on my arch + compilers so if it improves the situation on other targets(gmaxwell says it does) then that’s great :)
tACK 67a429f31fd3d1b37c5365cc58b70588b8645d62
__int128
, the type supported by majority of contemporary 64-bit compilers, gcc, clang, icc, xlc (for Linux)…
The library uses it, on 64-bit. Most of the code we were discussing was the 32-bit code, while the 64-bit code is similar but uses larger limbs and works into a 128-bit int. Unfortunately it isn’t a silver bullet either because we still need carries there too, just far fewer of them.
[But thanks for the comment, if we hadn’t been aware of that it would be a major revelation and much appreciated!]
What I tried to hint at is that one can use 128-bit additions to avoid handling carries with conditionals. For example muladd could look like this:
0uint128_t t = (uint128_t)a * b + c0;
1c0 = (uint64_t)t;
2t = (t>>64) + c1;
3c1 = (uint64_t)t;
4c2 += (uint64_t)(t>>64);
This can be naturally extended to 32-bit platforms by replacing uint64_t 64 with uint32_t 32 and uint128_t with uint64_t. And I kind of assumed that it was like that on 32-bit platforms…