--with-asm=x86_64
,
the selfcheck
method check by calling the cpuid
instruction, if the flags BMI2
(bit 8) and ADX
(bit 19) are set, and will exit early, if not.
--with-asm=x86_64
,
the selfcheck
method check by calling the cpuid
instruction, if the flags BMI2
(bit 8) and ADX
(bit 19) are set, and will exit early, if not.
198+ "mov -0x70(%%rsp),%%r12\n"
199+ "mov -0x68(%%rsp),%%r13\n"
200+ "mov -0x60(%%rsp),%%r14\n"
201+ "mov -0x58(%%rsp),%%r15\n"
202+ "mov -0x50(%%rsp),%%rdi\n"
203+ "mov %%rax,%%rdx\n"
You shouldn’t ever have a need to save or restore inputs in the assembly block; the compiler will produce entry/exit code around to make sure the asm block’s view is exactly as you specify.
If you specify a C value as input to the asm block, and you overwrite that memory location or register, it needs to be specified as an input/output variable rather than as an input.
"+D"(r), "+d"(b)
in the output and "S"(a)
in the input specification rather than "D"(r), "S"(a), "d"(b)
in input works well.
Is there a Docker container that I can clone,
You can build the Docker image locally using docker build . -f ci/linux-debian.Dockerfile
(add --no-cache --pull
to force rebuilds).
WIDEMUL:int128
.., Try ./configure --with-test-override-wide-multiply=int128
.
200+ "mov -0x60(%%rsp),%%r14\n"
201+ "mov -0x58(%%rsp),%%r15\n"
202+ "mov -0x50(%%rsp),%%rdi\n"
203+ "mov %%rax,%%rdx\n"
204+
205+ : "=m"(r[0]), "=m"(r[1]), "=m"(r[2]), "=m"(r[3]), "=m"(r[4])
%0
..%4
to refer to them (or %q0
through %q4
if you need to be explicit about them being 64 bit values).
%0
..%4
, but rather have my constant offsets from r
, %rdi
), I think I can omit line 205 entirely.
202+ "mov -0x50(%%rsp),%%rdi\n"
203+ "mov %%rax,%%rdx\n"
204+
205+ : "=m"(r[0]), "=m"(r[1]), "=m"(r[2]), "=m"(r[3]), "=m"(r[4])
206+
207+ : "D"(r), "S"(a), "d"(b), "m"(a[0]), "m"(b[0]), "m"(a[1]), "m"(b[1]),
One general comment: it’s better to let the compiler do register assignment, so it has the freedom to make it match wherever these values are already held in the call sites.
So you’d specify values as "r"
and refer to them in the asm block as %i
(with i being the position in the combined inputs+outputs list). The existing asm code doesn’t do this because it was written by hand and converted to inline asm, but in your situation it may be possible to be a bit more flexible.
%rdx
being special for mulx
. Then I’d need some way of conditionally emitting instructions in the sense of: “If the parameter is in %rdx
then save it to something else before doing the mulx
. Similar with the other registers. If the compiler decides which registers the parameters go into, I don’t know in the hard coded parts, which registers to use for the intermediate values. And lastly, the more we change this existing code, the less confident we are of the formal correctness. (On that note, I’d actually prefer intel syntax, as that is the one actually verified by Fiat. But I remember that I couldn’t get that to work either reliably with gcc / clang. Maybe that changed in recent versions though.)
But what’s special about these builds is that they have
WIDEMUL:int128
.., Try./configure --with-test-override-wide-multiply=int128
.
I actually did that locally, too but could not reproduce the issue. What worked was building the container and run the cirrus script in there with the given environment, as per cirrus logs.
I could then reproduce and fix the issue, which I believe depends a bit on the compiler version.
Essentially, the problem was that the assembly code relies on being a leaf function, i.e. assumes rsp
pointing to the end of the stack, can use the red zone, will spill callee-save registers when needed and unspill them when done. As the code is potentially inlined (as in those tests), those assumptions are incorrect.
I’ve changed the code by hand to not spill into rsp-based memory whatsoever, and instead use three tmp variables like the old code. Additionally, I used the +D
and +d
modifiers for the arguments.
The currently failing tests don’t seem to be related to the field arithmetic. Instead, they seem to fail at the self test when calling ./ctime
.
As far as I can see, it fails gracefully after the secp256k1_selftest_passes
returns 0
, because the required CPU flags are not available. This however, does not seem to be caused by missing cpuflags, but rather because of running CPUID in valgrind.
https://cirrus-ci.com/task/4534411765481472?logs=test#L237
==3831== Process terminating with default action of signal 6 (SIGABRT): dumping core ==3831== at 0x49B2CE1: raise (raise.c:51) ==3831== by 0x499C536: abort (abort.c:79) ==3831== by 0x484B2E3: secp256k1_default_error_callback_fn (util.h:72) ==3831== by 0x4852C95: secp256k1_callback_call (util.h:60) ==3831== by 0x4852C95: secp256k1_selftest (secp256k1.c:87) ==3831== by 0x4852C95: secp256k1_selftest (secp256k1.c:85) ==3831== by 0x4852D9F: secp256k1_context_preallocated_create (secp256k1.c:121) ==3831== by 0x4852E37: secp256k1_context_create (secp256k1.c:143) ==3831== by 0x1091F2: main (ctime_tests.c:45)
SECP256K1_CHECKMEM_RUNNING
for this, which I believe is also used in ./ctime_tests.c
to early abort if it is not running in valgrind.
Looks good to me now actually, but there are a couple points that I should point out:
CC
(using %q0
..%q2
). This, means that the code as it stands there, is not formally verified, because of the two changed just give. (It was not 100% before either, because we converted the verified Intel assembly to att syntax (by assembling though nasm
to objdump
ing it again as att)).--with-asm=auto
or not specifying that at all will now default to --with-asm=no
. If --with-asm=x86-64
is explicitly set, the CryptOpt generated version is used. The self check checks for presence of BMI2 and ADX flags and errors gracefully if not.18+ "mov (%%rax),%%rdx\n"
19+ "mulx 0x18(%%rsi),%%rcx,%%r8\n"
20+ "mov 0x10(%%rsi),%%rdx\n"
21+ "mulx 0x10(%%rax),%%r9,%%rbx\n"
22+ "mov 0x8(%%rax),%%rdx\n"
23+ "mulx 0x10(%%rsi),%%rbp,%%r12\n"
rbp
and r12
should contain b[1] * a[2]
%rbp
was an input to mulx
.
@dderjoel A common strategy to find reviewers is to convince them that the benefits of the PR are worth the review time. The PR should be as simple as possible to review. Squash the commits into a single or more logical commits. Provide some context and guidance on the best way to review the changes. It seems to me that reviewing this is difficult.
However, keep in mind that, as per the plan discussed here, integrating the fiat-crypto C output code is currently a higher priority.
24@@ -25,6 +25,22 @@ static int secp256k1_selftest_sha256(void) {
25 return secp256k1_memcmp_var(out, output32, 32) == 0;
26 }
27
28+static int secp256k1_selftest_cpu_check_passes(void) {
0static int secp256k1_selftest_cpuid(void) {
cpuid
says moresecp256k1_selftest_passes
has a “passes” suffix is mostly that secp256k1_selftest
was already gone for the corresponding API function34+ "shr $8, %%rax\n"
35+ "and %%rbx, %%rax\n"
36+ "and $1, %%rax\n"
37+ : "=a"(ret)
38+ : "a"(7), "c"(0)
39+ : "rdx", "rbx", "cc");
7+
8+#include "../util.h"
9+
10+SECP256K1_INLINE static void
11+secp256k1_fe_mul_inner(uint64_t *r, const uint64_t *a,
12+ const uint64_t *SECP256K1_RESTRICT b) {
SECP256K1_RESTRICT
from the old code, but I think it can be dropped. It’s rather meaningless here. (What is the compiler supposed to optimize here?! It’s just asm.) Or am I overlooking something?
- (What is the compiler supposed to optimize here?! It’s just asm.)
Now that I think about it again, I suggest leaving it in. It’s there now, and whether to remove it is probably a separate discussion.
But in general, can you confirm that the code here is intended to stick to the rules laid out in field.h
?
0 /* (...)
1 * r and a may point to the same object, but neither can be equal to b. (...)
2 */
3static void secp256k1_fe_mul(secp256k1_fe *r, const secp256k1_fe *a, const secp256k1_fe *b);
third_party
dir. :)
Indentation should be 4 spaces in C files. (Sometimes only 2 are used here.)
I agree with @jonasnick, it will help to have clean and well separated commits.
And yes, at least according to the plan we made, this is blocked on fiat-crypto C code. No matter if it’s blocked or not, it’s not realistic to get this merged by 28th. But I feel it’s certainly fair to say that the code is under review and planned to be integrated (if this helps).
gcc -O3 -march=skylake
and clang -march=skylake
See https://github.com/mit-plv/fiat-crypto/issues/1560#issuecomment-1610288223
rbp
. See this comment. I’m working on generating code without using rbp
, tracked by this issue. I’ll try to get this sorted by next week. Then get some code and update the PR.
Hi,
CryptOpt has now two new features, one is to not use rbp
and the other one is to value the potential memory overlap of the first two parameters, as @real-or-random said earlier. Although this property is not yet formally checked (tracked here), manually checking it shows that it is present. It also yields very similar (a bit faster actually) performance results, see below.
I’ve also rebased the branch on the current master, and changed the note in the Readme, that the assembly is no longer the hand optimized version, but the CryptOpt-generated one.
Those numbers are with clang -O3 -mtune=native -march=native
on my 10 different machines. (Making it comparable with the third table in this comment, showing that the fiat_cryptopt
column is a tad better.
implementation | default_asm | default_c | default_c52 | fiat_c | fiat_c_narrow_int | fiat_cryptopt |
---|---|---|---|---|---|---|
ecmult_gen | 15.7082 | 15.2509 | 15.6863 | 14.9906 | 14.548 | 13.9724 |
ecmult_const | 29.4523 | 28.0577 | 29.2846 | 27.2908 | 27.6586 | 26.3218 |
ecmult_1p | 23.3684 | 22.0285 | 23.1136 | 21.5132 | 21.9431 | 21.08 |
ecmult_0p_g | 17.0308 | 15.9766 | 16.5105 | 15.3775 | 16.4462 | 15.7157 |
ecmult_1p_g | 13.6926 | 12.9288 | 13.5101 | 12.6087 | 12.8326 | 12.3297 |
field_half | 0.00518338 | 0.00520039 | 0.00509803 | 0.00509937 | 0.00501453 | 0.005152 |
field_normalize | 0.00479336 | 0.00478621 | 0.00480812 | 0.00474333 | 0.00473504 | 0.00474445 |
field_normalize_weak | 0.00284125 | 0.00284032 | 0.00284684 | 0.0028214 | 0.00281811 | 0.00281636 |
field_sqr | 0.0136875 | 0.014248 | 0.014884 | 0.0135847 | 0.0117539 | 0.0119451 |
field_mul | 0.0170591 | 0.0168615 | 0.0179167 | 0.0167125 | 0.0147756 | 0.0143076 |
field_inverse | 1.52243 | 1.52486 | 1.52639 | 1.5235 | 1.51161 | 1.64494 |
field_inverse_var | 0.927991 | 0.923316 | 0.9266 | 0.916022 | 0.913502 | 0.926668 |
field_is_square_var | 1.21713 | 1.22058 | 1.21377 | 1.20963 | 1.20327 | 1.22883 |
field_sqrt | 3.83393 | 3.11757 | 3.34495 | 3.74102 | 3.31233 | 3.40706 |
static_casts
when preprocessing, and I do not check the bounds. Also, I assume that e.g. for an adx, the last parameter is a single bit.)
--inbounds-multiplier 8
.
./dettman_multiplication --inline --static --use-value-barrier secp256k1_dettman 64 5 48 2 '2^256 - 4294968273' $m --no-primitives --no-wide-int --shiftr-avoid-uint1 --output /dev/null --output-asm /dev/null --hints-file $f --inbounds-multiplier 8