(Previous discussion of this topic in #438 )
explicit_bzero() is used when available on the platform (glibc 2.25+), otherwise it falls back to an inline implementation. There are many possible versions of how the inline could be implemented but I have no reason to prefer one over another.
The commit stack does some refactoring in order to preserve the intended non-'cleanse' functionality done by several of the _clear() functions. I constructed the stack as unsquashed, incremental refactoring changes. Please let me know if a squash is preferred.
The measured delta in performance appears negligible. Trials were done inside a minimal headless VMs with Debian 8 in one for glibc 2.19 and the current base of Arch Linux in the other for glibc 2.25. The full session output of the trials was logged for reference and is available as a gist.
For reference to bare metal performance with the same CPU, timing on the underlying native Debian 8 install (with glibc 2.19 and gcc 4.9.2 running with master:
(wo/ endomorphism)
$ ./bench_sign
ecdsa_sign: min 55.3us / avg 55.4us / max 55.5us
$ ./bench_verify
ecdsa_verify: min 85.6us / avg 85.9us / max 86.3us
ecdsa_verify_openssl: min 476us / avg 484us / max 490us
Trials:
masterw/glibc 2.19,gcc 4.9.2w/ and wo/ endomorphisim
(wo/ endomorphism)
$ ./bench_sign
ecdsa_sign: min 55.9us / avg 56.4us / max 56.7us
$ ./bench_verify
ecdsa_verify: min 86.2us / avg 86.8us / max 87.7us
ecdsa_verify_openssl: min 487us / avg 488us / max 491us
(w/ endomorphism)
$ ./bench_sign
ecdsa_sign: min 56.1us / avg 56.6us / max 57.2us
$ ./bench_verify
ecdsa_verify: min 66.5us / avg 67.4us / max 68.8us
ecdsa_verify_openssl: min 487us / avg 493us / max 504us
- a16034039f07572a64bc9704c23abb1fff9d70ad w/
glibc 2.19gcc 4.9.2w/ and wo/ endomorphism
(wo/ endomorphism)
./bench_sign
ecdsa_sign: min 55.8us / avg 56.5us / max 57.0us
$ ./bench_verify
ecdsa_verify: min 86.3us / avg 86.6us / max 86.9us
ecdsa_verify_openssl: min 482us / avg 483us / max 484us
(w/ endomorphism)
$ ./bench_sign
ecdsa_sign: min 56.0us / avg 56.5us / max 56.7us
$ ./bench_verify
ecdsa_verify: min 65.7us / avg 66.0us / max 66.4us
ecdsa_verify_openssl: min 482us / avg 484us / max 489us
masterw/glibc 2.25,gcc 6.3.1w/ and wo/ endomorphisim
(wo/ endomorphism)
./bench_sign
ecdsa_sign: min 57.2us / avg 57.8us / max 58.1us
$ ./bench_verify
ecdsa_verify: min 76.1us / avg 76.7us / max 77.7us
ecdsa_verify_openssl: min 461us / avg 464us / max 482us
(w/ endomorphism)
$ ./bench_sign
ecdsa_sign: min 57.0us / avg 57.2us / max 57.5us
$ ./bench_verify
ecdsa_verify: min 54.8us / avg 55.1us / max 55.5us
ecdsa_verify_openssl: min 461us / avg 462us / max 467us
- a16034039f07572a64bc9704c23abb1fff9d70ad w/
glibc 2.25gcc 6.3.1w/ and wo/ endomorphism
(wo/ endomorphism)
./bench_sign
ecdsa_sign: min 56.7us / avg 57.0us / max 57.5us
$ ./bench_verify
ecdsa_verify: min 75.6us / avg 76.1us / max 76.6us
ecdsa_verify_openssl: min 460us / avg 460us / max 461us
(w/ endomorphism)
$ ./bench_sign
ecdsa_sign: min 56.5us / avg 56.8us / max 57.1us
$ ./bench_verify
ecdsa_verify: min 54.7us / avg 54.8us / max 55.0us
ecdsa_verify_openssl: min 462us / avg 463us / max 464us
The desired behavior of not getting optimized out was also verified by looking at the resulting assembly. With the inline implementation on glibc 2.19, the end section of secp256k1_ecmult_gen() from gcc 4.9.2 looks like:
movl %r15d, 120(%rbx) # infinity, r_7(D)->infinity
movq 128(%rsp), %rax # %sfp, ivtmp.249
cmpq $64, %rax #, ivtmp.249
je .L83 #,
salq $34, %rax #, D.9134
shrq $37, %rax #, D.9134
movl 304(%rsp,%rax,4), %r9d # gnb.d, D.9141
jmp .L84 #
.L83:
movq $memset, 768(%rsp) #, volatile_memset
leaq 300(%rsp), %rdi #, tmp5061
movq 768(%rsp), %rax # volatile_memset, D.9135
movl $4, %edx #,
xorl %esi, %esi #
call *%rax # D.9135
movq $memset, 816(%rsp) #, volatile_memset
leaq 976(%rsp), %rdi #, tmp5062
movq 816(%rsp), %rax # volatile_memset, D.9135
movl $84, %edx #,
xorl %esi, %esi #
call *%rax # D.9135
movq $memset, 864(%rsp) #, volatile_memset
leaq 304(%rsp), %rdi #, tmp5063
movq 864(%rsp), %rax # volatile_memset, D.9135
movl $32, %edx #,
xorl %esi, %esi #
call *%rax # D.9135
addq $1080, %rsp #,
With the explicit_bzero() from glibc 2.25 linked and with gcc 6.3.1, the same end section of secp256k1_ecmult_gen looks like:
movl $0, 240(%rsp) #, add.infinity
call secp256k1_gej_add_ge #
addq $1, (%rsp) #, %sfp
movq 8(%rsp), %r10 # %sfp, tmp242
movq (%rsp), %rax # %sfp, ivtmp.629
movq 16(%rsp), %r9 # %sfp, _90
movq 24(%rsp), %r8 # %sfp, _85
cmpq $64, %rax #, ivtmp.629
jne .L360 #,
leaq 60(%rsp), %rdi #, tmp380
movl $4, %esi #,
call explicit_bzero #
leaq 160(%rsp), %rdi #, tmp381
movl $88, %esi #,
call explicit_bzero #
leaq 64(%rsp), %rdi #, tmp382
movl $32, %esi #,
call explicit_bzero #
addq $264, %rsp #,