(Previous discussion of this topic in #438 )
explicit_bzero() is used when available on the platform (glibc 2.25+), otherwise it falls back to an inline implementation. There are many possible versions of how the inline could be implemented but I have no reason to prefer one over another.
The commit stack does some refactoring in order to preserve the intended non-‘cleanse’ functionality done by several of the _clear() functions. I constructed the stack as unsquashed, incremental refactoring changes. Please let me know if a squash is preferred.
The measured delta in performance appears negligible. Trials were done inside a minimal headless VMs with Debian 8 in one for glibc 2.19 and the current base of Arch Linux in the other for glibc 2.25. The full session output of the trials was logged for reference and is available as a gist.
For reference to bare metal performance with the same CPU, timing on the underlying native Debian 8 install (with glibc 2.19 and gcc 4.9.2 running with master:
0(wo/ endomorphism)
1$ ./bench_sign
2ecdsa_sign: min 55.3us / avg 55.4us / max 55.5us
3$ ./bench_verify
4ecdsa_verify: min 85.6us / avg 85.9us / max 86.3us
5ecdsa_verify_openssl: min 476us / avg 484us / max 490us
Trials:
masterw/glibc 2.19,gcc 4.9.2w/ and wo/ endomorphisim
0(wo/ endomorphism)
1$ ./bench_sign
2ecdsa_sign: min 55.9us / avg 56.4us / max 56.7us
3$ ./bench_verify
4ecdsa_verify: min 86.2us / avg 86.8us / max 87.7us
5ecdsa_verify_openssl: min 487us / avg 488us / max 491us
6
7(w/ endomorphism)
8$ ./bench_sign
9ecdsa_sign: min 56.1us / avg 56.6us / max 57.2us
10$ ./bench_verify
11ecdsa_verify: min 66.5us / avg 67.4us / max 68.8us
12ecdsa_verify_openssl: min 487us / avg 493us / max 504us
- a16034039f07572a64bc9704c23abb1fff9d70ad w/
glibc 2.19gcc 4.9.2w/ and wo/ endomorphism
0(wo/ endomorphism)
1./bench_sign
2ecdsa_sign: min 55.8us / avg 56.5us / max 57.0us
3$ ./bench_verify
4ecdsa_verify: min 86.3us / avg 86.6us / max 86.9us
5ecdsa_verify_openssl: min 482us / avg 483us / max 484us
6
7(w/ endomorphism)
8$ ./bench_sign
9ecdsa_sign: min 56.0us / avg 56.5us / max 56.7us
10$ ./bench_verify
11ecdsa_verify: min 65.7us / avg 66.0us / max 66.4us
12ecdsa_verify_openssl: min 482us / avg 484us / max 489us
masterw/glibc 2.25,gcc 6.3.1w/ and wo/ endomorphisim
0(wo/ endomorphism)
1./bench_sign
2ecdsa_sign: min 57.2us / avg 57.8us / max 58.1us
3$ ./bench_verify
4ecdsa_verify: min 76.1us / avg 76.7us / max 77.7us
5ecdsa_verify_openssl: min 461us / avg 464us / max 482us
6
7(w/ endomorphism)
8$ ./bench_sign
9ecdsa_sign: min 57.0us / avg 57.2us / max 57.5us
10$ ./bench_verify
11ecdsa_verify: min 54.8us / avg 55.1us / max 55.5us
12ecdsa_verify_openssl: min 461us / avg 462us / max 467us
- a16034039f07572a64bc9704c23abb1fff9d70ad w/
glibc 2.25gcc 6.3.1w/ and wo/ endomorphism
0(wo/ endomorphism)
1./bench_sign
2ecdsa_sign: min 56.7us / avg 57.0us / max 57.5us
3$ ./bench_verify
4ecdsa_verify: min 75.6us / avg 76.1us / max 76.6us
5ecdsa_verify_openssl: min 460us / avg 460us / max 461us
6
7(w/ endomorphism)
8$ ./bench_sign
9ecdsa_sign: min 56.5us / avg 56.8us / max 57.1us
10$ ./bench_verify
11ecdsa_verify: min 54.7us / avg 54.8us / max 55.0us
12ecdsa_verify_openssl: min 462us / avg 463us / max 464us
The desired behavior of not getting optimized out was also verified by looking at the resulting assembly. With the inline implementation on glibc 2.19, the end section of secp256k1_ecmult_gen() from gcc 4.9.2 looks like:
0 movl %r15d, 120(%rbx) # infinity, r_7(D)->infinity
1 movq 128(%rsp), %rax # %sfp, ivtmp.249
2 cmpq $64, %rax #, ivtmp.249
3 je .L83 #,
4 salq $34, %rax #, D.9134
5 shrq $37, %rax #, D.9134
6 movl 304(%rsp,%rax,4), %r9d # gnb.d, D.9141
7 jmp .L84 #
8.L83:
9 movq $memset, 768(%rsp) #, volatile_memset
10 leaq 300(%rsp), %rdi #, tmp5061
11 movq 768(%rsp), %rax # volatile_memset, D.9135
12 movl $4, %edx #,
13 xorl %esi, %esi #
14 call *%rax # D.9135
15 movq $memset, 816(%rsp) #, volatile_memset
16 leaq 976(%rsp), %rdi #, tmp5062
17 movq 816(%rsp), %rax # volatile_memset, D.9135
18 movl $84, %edx #,
19 xorl %esi, %esi #
20 call *%rax # D.9135
21 movq $memset, 864(%rsp) #, volatile_memset
22 leaq 304(%rsp), %rdi #, tmp5063
23 movq 864(%rsp), %rax # volatile_memset, D.9135
24 movl $32, %edx #,
25 xorl %esi, %esi #
26 call *%rax # D.9135
27 addq $1080, %rsp #,
With the explicit_bzero() from glibc 2.25 linked and with gcc 6.3.1, the same end section of secp256k1_ecmult_gen looks like:
0 movl $0, 240(%rsp) #, add.infinity
1 call secp256k1_gej_add_ge #
2 addq $1, (%rsp) #, %sfp
3 movq 8(%rsp), %r10 # %sfp, tmp242
4 movq (%rsp), %rax # %sfp, ivtmp.629
5 movq 16(%rsp), %r9 # %sfp, _90
6 movq 24(%rsp), %r8 # %sfp, _85
7 cmpq $64, %rax #, ivtmp.629
8 jne .L360 #,
9 leaq 60(%rsp), %rdi #, tmp380
10 movl $4, %esi #,
11 call explicit_bzero #
12 leaq 160(%rsp), %rdi #, tmp381
13 movl $88, %esi #,
14 call explicit_bzero #
15 leaq 64(%rsp), %rdi #, tmp382
16 movl $32, %esi #,
17 call explicit_bzero #
18 addq $264, %rsp #,