Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86

real-or-random commented at 2:32 pm on January 11, 2020: contributor

The issue is that MSVC for 32-bit targets implements 64x64->64 bit multiplications using a non-constant subroutine. The subroutine is not constant-time because it shortcuts when the high 32 bits of both multiplicands are all 0.

See https://research.kudelskisecurity.com/2017/01/16/when-constant-time-source-may-not-save-you/ and also https://www.bearssl.org/ctmul.html for a broader view of the issue.

By inspection of our 8x32 scalar and 10x26 field code, I found four places in the field code where the high bits are not guaranteed to be zero.

This PR inserts VERIFY_CHECKS in the 8x32 scalar code to ensure the high bits are indeed 0. There, all ->64 multiplications are in fact 32x32->64.

Moreover, this PR modifies the four multiplications in the 10x26 such that the right multiplicand, which is a constant, has never all high bits set to zero. This is ensured by shifting that constant to the left. The costs are two additional shift instructions for shifting the product back to right, for each field element multiplication and doubling. The correctness follows from the VERIFY_BITS statements for the other multiplicands preceeding the multiplications, which ensure that we have enough unused high bits such that the multiplication won’t overflow even with the left-shifted constant.

I feel that this is the most reasonable thing we can do without too much effort and loss of performance.

Possible alternatives:

Do the same but with MSVC conditional compilation: I think that’s also okay but conditional compilation makes makes testing harder, in particular because noone here uses Windows.
Write assembly for the multiplication. That may also improve performance for MSVC because their routine is not only variable time but also slow. But this is more work, harder to review and I don’t care about performance for MSVC. Moreover, they have a different asm syntax…
Blacklist the 32-bit for MSVC 32-bit. The drawback is that this leaves us with no option there, and people will probably just comment out the blacklisting. And MSVC is indeed used, for example it’s a travis-tested target for rust-secp256k1.
Do nothing (and blame the compiler). But I don’t think that’s clever given that this PR is simple. Of course there are many other non-constant multiplication issues for different platform but I don’t think that’s a reason to ignore this one.

In general, what do people think about pointing out in the README that the library is supposed to be portable but tested tested mostly on gcc and clang, and it’s therefore recommend to compile it there if possible? This sounds a little bit like “Optimized for Netspace Navigator” but please read https://blog.mozilla.org/nfroyd/2018/05/29/when-implementation-monoculture-right-thing/ for why Mozilla dropped MSVC and how nice clang-cl is as a drop-in replacement for MSVC.

This is WIP because I need to add detailed comments, and I first want to see what people think.

edit: And I had a godbolt environment to play around with this but I lost it. I’ll make a new one if people are interested.

gmaxwell commented at 3:52 pm on January 11, 2020: contributor

It might be better to use verifybits macros in the scalar code, instead of an unstructured verifycheck.

With a fair amount effort previously, I extracted our 32bit field multiply instructions and verified it to be free of overflow and (I think) the verifybits statements in frama-c to be free of overflow. If this used the same macros a future redo would pick up the static analysis for free. I got discouraged before because frama-c didn’t support __int128, but presumably it (or another tool) would eventually and it might be reasonable to automated running range analysis on the C-language multiplier functions.

My main complaint with making it not conditional is that your fix isn’t free. (At least I assume it isn’t from a glance, haven’t tested, if it’s not actually a benchmarkable difference on GCC ignore the rest of this comment).

It since it’s only a few lines it wouldn’t be too messy to make conditional. As far as testing goes, this is something we should be able to have CI test. MSVC is a common enough compiler that it’s probably worth doing that regardless.

I wish I saw a way to detect any more of these being added in the codebase. I could detect 64x64->64 on secret inputs in modified valgrind… but couldn’t automatically exclude ones where the high words will not be zero. It’s nice to document that these are the only 64-bit output multiplies that aren’t really 32x32->64… as that fact might be really useful for anyone trying to make a SIMD version of these functions.

real-or-random commented at 1:16 pm on February 12, 2020: contributor

My main complaint with making it not conditional is that your fix isn’t free. (At least I assume it isn’t from a glance, haven’t tested, if it’s not actually a benchmarkable difference on GCC ignore the rest of this comment).

I had verified this statement in godbolt:

The costs are two additional shift instructions for shifting the product back to right, for each field element multiplication and doubling.

I haven’t run benchmarks yet but I doubt that two shifts are a benchmarkable difference.

real-or-random force-pushed on Mar 24, 2020

real-or-random marked this as ready for review on Mar 24, 2020

real-or-random renamed this:
~~WIP Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86~~
Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86
on Mar 24, 2020

sipa commented at 5:03 pm on March 24, 2020: contributor

My preference is doing this with conditional compilation for MSVC. It’s probably a trivial performance hit anyway, but it’s also an unfortunate complication for someone who doesn’t care about MSVC trying to audit the code.

You’re right that this complicates testing, but I think that’s simply a side effect of MSVC being undertested as a platform. That is already a problem, and this is exposing it. Bitcoin Core runs the secp256k1 tests on MSVC via AppVeyor; I’m sure people would be willing to help set up an AppVeyor instance for the secp256k1 repo too.

real-or-random commented at 5:08 pm on March 24, 2020: contributor

This is ready for review.

Here’s a compiler inspector instance to play around with this: https://godbolt.org/z/58L7rR

You can recreate this by pasting the following files, comment out the now unresolved #includes and remove SECP256K1_INLINE and static for all functions you’re interested in.

incude/secp256k1.h
src/util.h
src/field_10x26.h
src/field_10x26_impl.h
src/scalar_8x32.h
src/scalar_8x32_impl.h

real-or-random commented at 5:12 pm on March 24, 2020: contributor

My preference is doing this with conditional compilation for MSVC. It’s probably a trivial performance hit anyway, but it’s also an unfortunate complication for someone who doesn’t care about MSVC trying to audit the code.

Hm, I can make it conditional but I’m not super convinced that this is better. Making it conditional increases the burden for reviewers who care about multiple platforms. I think my comment is detailed enough to review audit these 4 multiplications quickly, and this is probably < 1 % of the work you’re doing if you’re auditing this code.

elichai cross-referenced this on Mar 26, 2020 from issue Constant-time behaviour test using valgrind memtest. by gmaxwell

real-or-random commented at 11:45 am on April 2, 2020: contributor

@jonasnick figured out that my calculations are off: This only ensures that the upper 48 bits are non-zero (instead of the upper 32 bits). Then approach doesn’t work actually, oops. Marking this as WIP for now, I need to think about this.

real-or-random renamed this:
~~Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86~~
WIP: Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86
on Apr 2, 2020

real-or-random marked this as a draft on Apr 22, 2020

real-or-random force-pushed on Apr 22, 2020

real-or-random renamed this:
~~WIP: Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86~~
Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86
on Apr 22, 2020

real-or-random marked this as ready for review on Apr 22, 2020

real-or-random commented at 3:18 pm on April 22, 2020: contributor

Okay, I updated this to use a different approach.

Also I made this conditional on MSVC. I’m still not convinced that this is better but I seem to be the only one who argues for making it unconditional.

in src/field_10x26_impl.h:457 in edc2d92019 outdated

453@@ -454,18 +454,15 @@ void secp256k1_fe_sqr_inner(uint32_t *r, const uint32_t *a);
454 
455 #else
456 
457-#ifdef VERIFY

sipa commented at 8:35 pm on April 22, 2020:

Also remove these from src/field_5x52_int128_impl.h?

real-or-random commented at 11:17 am on April 23, 2020:

fixed and also updated some of comment

real-or-random force-pushed on Apr 23, 2020

Make 64x64->64 multiplications constant-time with MSVC on 32bit x86 ad6262705f

real-or-random force-pushed on Jul 26, 2020

real-or-random commented at 11:29 am on July 26, 2020: contributor

rebased

real-or-random cross-referenced this on Jul 26, 2020 from issue Improve constant-timeness on PowerPC by real-or-random

real-or-random commented at 7:20 pm on July 28, 2020: contributor

I discussed this with Thomas Pornin (BearSSL) and he pointed out that

The proof can be made simpler and more “obvious” in two different ways:

With maths:

“and 2^n-1” is equivalent to “mod 2^n”. Thus, given a and b, you want a*b mod 2^n. “b or 2^n” is really b + 2^n, which is equal to b modulo 2^n, and thus the result is correct.

By looking at carries:

Multiplication is really a bunch of left-shifts and additions (that’s the “schoolbook method”). When you modify a bit at rank k, it can impact only bits at rank k or higher, because carry propagation in additions is only to the left. So, setting bit 63 in an operand cannot modify bits 0-62 in the result.

He also mentioned that a different proper way would be to simply provide our own (inlinable) imlementation of 64x64->64 bit multiplication. That’s true. So far I haven’t chosen this approach because 1) it would touch more of the existing code, and 2) we don’t care about MSVC performance. But if we anyway would enable the workaround proposed in this PR conditionally only on MSVC, and with https://github.com/bitcoin/bitcoin/pull/17309/commits/162d0038e772bf6210b9218df35c14fd9719d3f7 in mind (which is totally not aware of), maybe we should just fix it the proper way.

gmaxwell commented at 11:13 pm on July 28, 2020: contributor

way would be to simply provide our own (inlinable) imlementation of 64x64->64 bit multiplication. That’s true. So far I haven’t chosen this approach because 1) it would touch more of the existing code

One way to do that is to make a MUL_32_32_64 macro, and on msvc it gets converted to an inlinable function call and on everything else it just gets converted to the existing code. This would aid portability to 32-bit platforms that don’t have a 64-bit type. Review would be straightforward in part because the object code on non-MSVC platforms would be completely unchanged.

[The related “doesn’t have a 128-bit type” problem has prevented me from using various static analysis tools on the 64-bit code.]

real-or-random commented at 6:05 am on July 29, 2020: contributor

[The related “doesn’t have a 128-bit type” problem has prevented me from using various static analysis tools on the 64-bit code.]

Yeah, and there’s the _umul128 intrinsic for this on MSVC (https://docs.microsoft.com/en-us/cpp/intrinsics/umul128?view=vs-2019). It has a highly intuitive syntax. (Why didn’t they simply provide a 128 bit type…?)

So we could “solve” this one, too.

real-or-random cross-referenced this on Aug 9, 2020 from issue Make scalar/field choice depend on C-detected __int128 availability by sipa

sipa commented at 3:45 am on September 9, 2020: contributor

ACK but needs rebase.

real-or-random commented at 9:56 am on September 9, 2020: contributor

I’m leaning towards abandoning this PR and solving the issue using a separate 32x32->64 routine, as discussed above. @sipa What do you think?

sipa commented at 2:31 am on September 10, 2020: contributor

@real-or-random I don’t follow, that seems like it’s addressing something unrelated. The 32x32->64 multiplications aren’t the problem, because (uint64_t)a*b works just fine in that case for MSVC. The problem is the 64x32->64 multiplications, where it doesn’t seem we have a solution at all in general, except the trick used in this current PR which is only applicable if we only care about the bottom 63 bits of the output.

peterdettman commented at 3:22 am on September 10, 2020: contributor

Just a heads-up that I have a rewrite of the 10x26 field mul in progress that contains only 32x32 multiplications, amongst other changes.

real-or-random commented at 12:45 pm on September 10, 2020: contributor

@sipa Ok, things got a little messed up in this thread.

way would be to simply provide our own (inlinable) imlementation of 64x64->64 bit multiplication. That’s true. So far I haven’t chosen this approach because 1) it would touch more of the existing code

One way to do that is to make a MUL_32_32_64 macro, and on msvc it gets converted to an inlinable function call and on everything else it just gets converted to the existing code.

This is where the confusion starts. When I read this for the first time (back in July), I assumed @gmaxwell meant to write “a MUL_64_64_64 macro”. But now I apparently adopted this purported “typo”…

So there are two different issues.

a) 64x64->64 is not constant-time on MSVC

What I want to do in order to avoid the variable-time issue is to implement a MUL64_64_64 macro or inlinable function for MSVC, This was suggested by Thomas Pornin along with this canonical implementation

 0    static inline uint64_t
 1    mul64x64to64(uint64_t x, uint64_t y)
 2    {
 3        uint32_t xl, xh, yl, yh;
 4
 5        xl = (uint32_t)x;
 6        xh = (uint32_t)(x >> 32);
 7        yl = (uint32_t)y;
 8        yh = (uint32_t)(y >> 32);
 9        return (uint64_t)xl * (uint64_t)yl
10            + ((uint64_t)(xl * yh + xh * yl) << 32);
11    }

The tree multiplications in this routine are all formally 32x32->64 muls. MSVC’s optimizer detects that correctly and output normal mul and imul instruction. Even if MSVC implemented the multiplication using its llmul routine, the routine is constant-time over 32x32 inputs.

This is much more straightforward and also much faster than the solution in this PR.

b) 32x32->64 and 64x64->128 macros

This is just an orthogonal different issue. These macros would make our code work on 32 bit implementations where we don’t have a 64 bit type, and on 64 bit implementations where we don’t have a 128 bit type (such as MSVC). This would also improve performance on MSVC in cases that MSVC is not clever enough to figure out that a ->64 multiplication is in fact 32x32->64.

Just a heads-up that I have a rewrite of the 10x26 field mul in progress that contains only 32x32 multiplications, amongst other changes.

That sounds great and this rewrite would solve a) and at least the 32x32->64 / field part of b). Can you say more?

gmaxwell commented at 12:48 pm on September 10, 2020: contributor

Fast code that uses only 32-bit multiplies would also be a great prototype for a fast N-way SIMD version.

peterdettman commented at 1:08 pm on September 10, 2020: contributor

That sounds great and this rewrite would solve a) and at least the 32x32->64 / field part of b). Can you say more?

Nothing too fancy on this score; the new code just avoids multiplying by the 64-bit accumulators.

Fast code that uses only 32-bit multiplies would also be a great prototype for a fast N-way SIMD version. @gmaxwell My immediate target is an arbitrary-degree Karatsuba (https://eprint.iacr.org/2015/1247) rewrite for 10x26, which already has some potential for vectorisation within a single fe_mul. I hope you’ll be able to benchmark it on the real hardware in due course.

EDIT: Maybe just to be super-clear, I’m simply using 32x32->64 multiplies.

sipa commented at 4:22 pm on September 10, 2020: contributor

@real-or-random Understood now. Concept ACK on both.

peterdettman commented at 4:42 pm on September 13, 2020: contributor

#815 removes the problematic multiplies from the 10x26 field implementation.

real-or-random commented at 8:43 pm on September 13, 2020: contributor

#815 removes the problematic multiplies from the 10x26 field implementation.

Okay, closing here for now. Depending on the the progress in #815, it may still be useful to have a temporary solution but it’s clear that I won’t be this PR.

real-or-random closed this on Sep 13, 2020

real-or-random cross-referenced this on Dec 1, 2022 from issue 64x64->64 muls are not constant-time with MSVC on 32bit x86 by real-or-random

Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86 #711

a) 64x64->64 is not constant-time on MSVC

b) 32x32->64 and 64x64->128 macros