In this PR the auto detection of x86_64 is removed and the configure file will default to the C implementation.
Further, if asm has explicitly been requested with --with-asm=x86_64,
the selfcheck method check by calling the cpuid instruction, if the flags BMI2 (bit 8) and ADX (bit 19) are set, and will exit early, if not.
Replace ASM with CryptOpt generated #1329
pull dderjoel wants to merge 4 commits into bitcoin-core:master from dderjoel:only-asm changing 5 files +363 −495-
dderjoel commented at 5:02 AM on May 25, 2023: none
-
in src/third_party/field_5x52_asm_impl_cryptopt.h:203 in 183ebfafc5 outdated
198 | + "mov -0x70(%%rsp),%%r12\n" 199 | + "mov -0x68(%%rsp),%%r13\n" 200 | + "mov -0x60(%%rsp),%%r14\n" 201 | + "mov -0x58(%%rsp),%%r15\n" 202 | + "mov -0x50(%%rsp),%%rdi\n" 203 | + "mov %%rax,%%rdx\n"
dderjoel commented at 5:05 AM on May 25, 2023:@sipa, re #1261 (comment) and #1261 (comment) those two instructions are sometimes not necessary. They are conservatively restoring the input variables.
sipa commented at 12:23 PM on May 26, 2023:You shouldn't ever have a need to save or restore inputs in the assembly block; the compiler will produce entry/exit code around to make sure the asm block's view is exactly as you specify.
If you specify a C value as input to the asm block, and you overwrite that memory location or register, it needs to be specified as an input/output variable rather than as an input.
dderjoel commented at 2:07 AM on May 29, 2023:I understand. Using
"+D"(r), "+d"(b)in the output and"S"(a)in the input specification rather than"D"(r), "S"(a), "d"(b)in input works well.dderjoel commented at 8:53 AM on May 25, 2023: noneCan someone help with those two errors? I can see that those runs exit with segmentation faults, but I cannot reproduce them on my machine. How do you typically debug those (I have no experience with Cirrus.CI). Is there a Docker container that I can clone, or the binary I can run in a debugger locally?
real-or-random commented at 9:46 AM on May 25, 2023: contributorIs there a Docker container that I can clone,
You can build the Docker image locally using
docker build . -f ci/linux-debian.Dockerfile(add--no-cache --pullto force rebuilds).real-or-random commented at 9:49 AM on May 25, 2023: contributorBut what's special about these builds is that they have
WIDEMUL:int128.., Try./configure --with-test-override-wide-multiply=int128.in src/third_party/field_5x52_asm_impl_cryptopt.h:205 in 5db0678b6a outdated
200 | + "mov -0x60(%%rsp),%%r14\n" 201 | + "mov -0x58(%%rsp),%%r15\n" 202 | + "mov -0x50(%%rsp),%%rdi\n" 203 | + "mov %%rax,%%rdx\n" 204 | + 205 | + : "=m"(r[0]), "=m"(r[1]), "=m"(r[2]), "=m"(r[3]), "=m"(r[4])
sipa commented at 12:21 PM on May 26, 2023:You don't know which memory locations the compiler will assign to these variables. You need to use
%0..%4to refer to them (or%q0through%q4if you need to be explicit about them being 64 bit values).
dderjoel commented at 12:23 AM on May 29, 2023:I just wanted to make the compiler aware, that I am writing to those values in the asm block, such that the compiler is not issuing a dirty read if it interleaves it with something else. I feel this is more of a precaution that I did because I kept getting segfaults. As I am not reading those values anyway, (i.e. use
%0..%4, but rather have my constant offsets fromr,%rdi), I think I can omit line 205 entirely.in src/third_party/field_5x52_asm_impl_cryptopt.h:207 in 5db0678b6a outdated
202 | + "mov -0x50(%%rsp),%%rdi\n" 203 | + "mov %%rax,%%rdx\n" 204 | + 205 | + : "=m"(r[0]), "=m"(r[1]), "=m"(r[2]), "=m"(r[3]), "=m"(r[4]) 206 | + 207 | + : "D"(r), "S"(a), "d"(b), "m"(a[0]), "m"(b[0]), "m"(a[1]), "m"(b[1]),
sipa commented at 12:31 PM on May 26, 2023:One general comment: it's better to let the compiler do register assignment, so it has the freedom to make it match wherever these values are already held in the call sites.
So you'd specify values as
"r"and refer to them in the asm block as%i(with i being the position in the combined inputs+outputs list). The existing asm code doesn't do this because it was written by hand and converted to inline asm, but in your situation it may be possible to be a bit more flexible.
dderjoel commented at 12:34 AM on May 29, 2023:Hm, I see. I think I wanted to do this too, but I wasn't able to because of
%rdxbeing special formulx. Then I'd need some way of conditionally emitting instructions in the sense of: "If the parameter is in%rdxthen save it to something else before doing themulx. Similar with the other registers. If the compiler decides which registers the parameters go into, I don't know in the hard coded parts, which registers to use for the intermediate values. And lastly, the more we change this existing code, the less confident we are of the formal correctness. (On that note, I'd actually prefer intel syntax, as that is the one actually verified by Fiat. But I remember that I couldn't get that to work either reliably with gcc / clang. Maybe that changed in recent versions though.)dderjoel commented at 2:29 AM on May 29, 2023: noneBut what's special about these builds is that they have
WIDEMUL:int128.., Try./configure --with-test-override-wide-multiply=int128.I actually did that locally, too but could not reproduce the issue. What worked was building the container and run the cirrus script in there with the given environment, as per cirrus logs.
I could then reproduce and fix the issue, which I believe depends a bit on the compiler version. Essentially, the problem was that the assembly code relies on being a leaf function, i.e. assumes
rsppointing to the end of the stack, can use the red zone, will spill callee-save registers when needed and unspill them when done. As the code is potentially inlined (as in those tests), those assumptions are incorrect.I've changed the code by hand to not spill into rsp-based memory whatsoever, and instead use three tmp variables like the old code. Additionally, I used the
+Dand+dmodifiers for the arguments.dderjoel commented at 4:15 AM on May 29, 2023: noneThe currently failing tests don't seem to be related to the field arithmetic. Instead, they seem to fail at the self test when calling
./ctime. As far as I can see, it fails gracefully after thesecp256k1_selftest_passesreturns0, because the required CPU flags are not available. This however, does not seem to be caused by missing cpuflags, but rather because of running CPUID in valgrind.https://cirrus-ci.com/task/4534411765481472?logs=test#L237
==3831== Process terminating with default action of signal 6 (SIGABRT): dumping core ==3831== at 0x49B2CE1: raise (raise.c:51) ==3831== by 0x499C536: abort (abort.c:79) ==3831== by 0x484B2E3: secp256k1_default_error_callback_fn (util.h:72) ==3831== by 0x4852C95: secp256k1_callback_call (util.h:60) ==3831== by 0x4852C95: secp256k1_selftest (secp256k1.c:87) ==3831== by 0x4852C95: secp256k1_selftest (secp256k1.c:85) ==3831== by 0x4852D9F: secp256k1_context_preallocated_create (secp256k1.c:121) ==3831== by 0x4852E37: secp256k1_context_create (secp256k1.c:143) ==3831== by 0x1091F2: main (ctime_tests.c:45)
dderjoel commented at 5:30 AM on May 29, 2023: noneNice, so I've changed the selftest such that it will not check the CPU flags, if it is running in valgrind. I've used
SECP256K1_CHECKMEM_RUNNINGfor this, which I believe is also used in./ctime_tests.cto early abort if it is not running in valgrind.dderjoel marked this as ready for review on May 29, 2023dderjoel commented at 8:04 AM on May 29, 2023: noneLooks good to me now actually, but there are a couple points that I should point out:
- CPUID in the self test is not checked when using valgrind.
- The spilling / unspilling of callee-saved registers is removed from the assembly implementations and with
- the spilling of three intermediate values is now handled by
CC(using%q0..%q2). This, means that the code as it stands there, is not formally verified, because of the two changed just give. (It was not 100% before either, because we converted the verified Intel assembly to att syntax (by assembling thoughnasmtoobjdumping it again as att)). - Using
--with-asm=autoor not specifying that at all will now default to--with-asm=no. If--with-asm=x86-64is explicitly set, the CryptOpt generated version is used. The self check checks for presence of BMI2 and ADX flags and errors gracefully if not.
in src/third_party/field_5x52_asm_impl_cryptopt.h:23 in e28dd68043 outdated
18 | + "mov (%%rax),%%rdx\n" 19 | + "mulx 0x18(%%rsi),%%rcx,%%r8\n" 20 | + "mov 0x10(%%rsi),%%rdx\n" 21 | + "mulx 0x10(%%rax),%%r9,%%rbx\n" 22 | + "mov 0x8(%%rax),%%rdx\n" 23 | + "mulx 0x10(%%rsi),%%rbp,%%r12\n"
sipa commented at 4:54 PM on May 31, 2023:What is in RBP at this point?
dderjoel commented at 6:22 AM on June 2, 2023:I don't know, will be overwritten, right? I've put rbp into the clobbers.
dderjoel commented at 6:24 AM on June 2, 2023:rbpandr12should containb[1] * a[2]
sipa commented at 3:38 PM on June 12, 2023:Oh, I see. I misunderstood the syntax, and thought
%rbpwas an input tomulx.dderjoel commented at 10:38 PM on June 7, 2023: noneIs there anything I can help with for this PR? What would the next steps be?
real-or-random added the label assurance on Jun 11, 2023real-or-random added the label performance on Jun 11, 2023dderjoel commented at 3:32 PM on June 17, 2023: noneHey, we’d love to mention this at an upcoming conference (acknowledging your support; top event in heuristic optimization), but for that it would have to be merged by 28 June. Of course, we understand that you need to be convinced of the usefulness of the verified and optimized... and of course we understand that you have your own workflows and deadlines, too. Still, if there is anything we can do to push this further, please do let us know.
jonasnick commented at 2:51 PM on June 20, 2023: contributor@dderjoel A common strategy to find reviewers is to convince them that the benefits of the PR are worth the review time. The PR should be as simple as possible to review. Squash the commits into a single or more logical commits. Provide some context and guidance on the best way to review the changes. It seems to me that reviewing this is difficult.
However, keep in mind that, as per the plan discussed here, integrating the fiat-crypto C output code is currently a higher priority.
in src/selftest.h:28 in 57980c90b4 outdated
24 | @@ -25,6 +25,22 @@ static int secp256k1_selftest_sha256(void) { 25 | return secp256k1_memcmp_var(out, output32, 32) == 0; 26 | } 27 | 28 | +static int secp256k1_selftest_cpu_check_passes(void) {
real-or-random commented at 9:57 PM on June 20, 2023:static int secp256k1_selftest_cpuid(void) {cpuidsays more- the reason
secp256k1_selftest_passeshas a "passes" suffix is mostly thatsecp256k1_selftestwas already gone for the corresponding API function
in src/selftest.h:39 in 57980c90b4 outdated
34 | + "shr $8, %%rax\n" 35 | + "and %%rbx, %%rax\n" 36 | + "and $1, %%rax\n" 37 | + : "=a"(ret) 38 | + : "a"(7), "c"(0) 39 | + : "rdx", "rbx", "cc");
real-or-random commented at 9:58 PM on June 20, 2023:- I think it would be better to reduce the asm part to just getting the CPUID value, and use normal C code to do the bit fiddling.
- The bit fiddling part should explain what it checks for, i.e., ADX and BMI2.
in src/third_party/field_5x52_asm_impl_cryptopt.h:12 in 57980c90b4 outdated
7 | + 8 | +#include "../util.h" 9 | + 10 | +SECP256K1_INLINE static void 11 | +secp256k1_fe_mul_inner(uint64_t *r, const uint64_t *a, 12 | + const uint64_t *SECP256K1_RESTRICT b) {
real-or-random commented at 10:04 PM on June 20, 2023:- This can go on one line. (We don't have a fixed value for max length of lines, but they tend to be longer than shorter.)
- I know you took the
SECP256K1_RESTRICTfrom the old code, but I think it can be dropped. It's rather meaningless here. (What is the compiler supposed to optimize here?! It's just asm.) Or am I overlooking something?
real-or-random commented at 2:34 PM on June 29, 2023:- (What is the compiler supposed to optimize here?! It's just asm.)
Now that I think about it again, I suggest leaving it in. It's there now, and whether to remove it is probably a separate discussion.
But in general, can you confirm that the code here is intended to stick to the rules laid out in
field.h?/* (...) * r and a may point to the same object, but neither can be equal to b. (...) */ static void secp256k1_fe_mul(secp256k1_fe *r, const secp256k1_fe *a, const secp256k1_fe *b);
dderjoel commented at 4:44 AM on July 1, 2023:Neither the Optimizer, nor the checker enforces that property. But manually checking it at the moment shows that it is the case. Maybe it's possible to check that on the fiat side. I'm tracking that in issue https://github.com/0xADE1A1DE/CryptOpt/issues/167 for the cryptopt side.
in src/third_party/field_5x52_asm_impl_cryptopt.h:1 in 57980c90b4
real-or-random commented at 10:07 PM on June 20, 2023:I think we could keep the old file name.
real-or-random commented at 1:18 PM on June 29, 2023:Sorry, I meant the entire name (including the path). No need to have a
third_partydir. :)real-or-random commented at 10:11 PM on June 20, 2023: contributorIndentation should be 4 spaces in C files. (Sometimes only 2 are used here.)
I agree with @jonasnick, it will help to have clean and well separated commits.
And yes, at least according to the plan we made, this is blocked on fiat-crypto C code. No matter if it's blocked or not, it's not realistic to get this merged by 28th. But I feel it's certainly fair to say that the code is under review and planned to be integrated (if this helps).
dderjoel force-pushed on Jun 21, 2023dderjoel force-pushed on Jun 21, 2023dderjoel commented at 2:54 PM on June 21, 2023: noneThanks for the feedback. I kept the commits in there to have the history and thought we may squash them on merge. I've rebased everything now from the current master, and renamed the commits, while incorporating the ideas from you all.
dderjoel force-pushed on Jun 21, 2023dderjoel commented at 10:27 PM on June 27, 2023: noneDo not merge until tested with
gcc -O3 -march=skylakeandclang -march=skylakeSee https://github.com/mit-plv/fiat-crypto/issues/1560#issuecomment-1610288223dderjoel marked this as a draft on Jun 27, 2023dderjoel commented at 8:46 AM on June 28, 2023: noneProblem seems indeed to be
rbp. See this comment. I'm working on generating code without usingrbp, tracked by this issue. I'll try to get this sorted by next week. Then get some code and update the PR.dderjoel force-pushed on Jul 1, 2023dderjoel force-pushed on Jul 1, 2023dderjoel marked this as ready for review on Jul 1, 2023khadem5677 approvedkhadem5677 approvedkhadem5677 approvedbuild: not default to x64_asm. only use the asm if explicitly requested e0494aafd9added runtime check for bmi2 and adx d38ab07c01replace the asm implementation for mul/square inner 4742310ff4dderjoel force-pushed on Jul 7, 2023dderjoel commented at 12:45 AM on July 7, 2023: noneHi, CryptOpt has now two new features, one is to not use
rbpand the other one is to value the potential memory overlap of the first two parameters, as @real-or-random said earlier. Although this property is not yet formally checked (tracked here), manually checking it shows that it is present. It also yields very similar (a bit faster actually) performance results, see below.I've also rebased the branch on the current master, and changed the note in the Readme, that the assembly is no longer the hand optimized version, but the CryptOpt-generated one.
Those numbers are with
clang -O3 -mtune=native -march=nativeon my 10 different machines. (Making it comparable with the third table in this comment, showing that thefiat_cryptoptcolumn is a tad better.implementation default_asm default_c default_c52 fiat_c fiat_c_narrow_int fiat_cryptopt ecmult_gen 15.7082 15.2509 15.6863 14.9906 14.548 13.9724 ecmult_const 29.4523 28.0577 29.2846 27.2908 27.6586 26.3218 ecmult_1p 23.3684 22.0285 23.1136 21.5132 21.9431 21.08 ecmult_0p_g 17.0308 15.9766 16.5105 15.3775 16.4462 15.7157 ecmult_1p_g 13.6926 12.9288 13.5101 12.6087 12.8326 12.3297 field_half 0.00518338 0.00520039 0.00509803 0.00509937 0.00501453 0.005152 field_normalize 0.00479336 0.00478621 0.00480812 0.00474333 0.00473504 0.00474445 field_normalize_weak 0.00284125 0.00284032 0.00284684 0.0028214 0.00281811 0.00281636 field_sqr 0.0136875 0.014248 0.014884 0.0135847 0.0117539 0.0119451 field_mul 0.0170591 0.0168615 0.0179167 0.0167125 0.0147756 0.0143076 field_inverse 1.52243 1.52486 1.52639 1.5235 1.51161 1.64494 field_inverse_var 0.927991 0.923316 0.9266 0.916022 0.913502 0.926668 field_is_square_var 1.21713 1.22058 1.21377 1.20963 1.20327 1.22883 field_sqrt 3.83393 3.11757 3.34495 3.74102 3.31233 3.40706 dderjoel commented at 12:45 AM on August 8, 2023: none@andres-erbsen , you're saying the C code didn't change. Is there anything that we'd believe would change in the JSON? I wonder if I would need to give it a try to generate myself and check. (Reason why I think there could be changes is that I am removing all
static_castswhen preprocessing, and I do not check the bounds. Also, I assume that e.g. for an adx, the last parameter is a single bit.)andres-erbsen commented at 2:51 AM on August 8, 2023: noneI would guess there are no relevant changes. If so, I expect the fiat-crypto equivalence checker would pass with the assembly files you have and
--inbounds-multiplier 8.dderjoel commented at 5:23 AM on August 8, 2023: noneNice, I've double checked: I've gotten the latest binaries from the Artifact of https://github.com/mit-plv/fiat-crypto/actions/runs/5721610921 Then I've checked all the current files including the ones for this PR and they all pass. with
./dettman_multiplication --inline --static --use-value-barrier secp256k1_dettman 64 5 48 2 '2^256 - 4294968273' $m --no-primitives --no-wide-int --shiftr-avoid-uint1 --output /dev/null --output-asm /dev/null --hints-file $f --inbounds-multiplier 8Merge branch 'bitcoin-core:master' into only-asm 3eaab693d2
github-metadata-mirror
This is a metadata mirror of the GitHub repository bitcoin-core/secp256k1. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-14 22:15 UTC
This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me