Is the compiler optimizing out some of the benchmarks? #667

pull elichai wants to merge 2 commits into bitcoin-core:master from elichai:2019-09-bench changing 2 files +78 −24
  1. elichai commented at 5:41 pm on September 27, 2019: contributor

    I’ve been playing with the benchmarks and I started thinking about the optimizer. Now most of the benchmarks reuse the variables from past iterations(thanks @apoelstra for showing me that:) ) but the compiler might still be able to figure out it can remove at least the last iteration.

    another thing is the jacobi symbol calculation, that has 0 side effect, so I would assume the optimizer will just remove it (which it seems to do so). I’m not sure what to make out of these results and I think I’ll need to try and read the ASM itself to figure out what’s going on.

    But before:

     0$ ./bench_internal 
     1scalar_add: min 0.00771us / avg 0.00811us / max 0.00934us
     2scalar_negate: min 0.00274us / avg 0.00277us / max 0.00280us
     3scalar_sqr: min 0.0296us / avg 0.0301us / max 0.0308us
     4scalar_mul: min 0.0305us / avg 0.0311us / max 0.0323us
     5scalar_inverse: min 8.65us / avg 8.76us / max 9.03us
     6scalar_inverse_var: min 2.08us / avg 2.14us / max 2.21us
     7field_normalize: min 0.00718us / avg 0.00744us / max 0.00797us
     8field_normalize_weak: min 0.00294us / avg 0.00316us / max 0.00360us
     9field_sqr: min 0.0140us / avg 0.0147us / max 0.0153us
    10field_mul: min 0.0185us / avg 0.0189us / max 0.0195us
    11field_inverse: min 4.09us / avg 4.12us / max 4.17us
    12field_inverse_var: min 2.10us / avg 2.11us / max 2.15us
    13field_sqrt: min 4.07us / avg 4.09us / max 4.14us
    14group_double_var: min 0.123us / avg 0.124us / max 0.126us
    15group_add_var: min 0.283us / avg 0.284us / max 0.287us
    16group_add_affine: min 0.242us / avg 0.244us / max 0.249us
    17group_add_affine_var: min 0.203us / avg 0.204us / max 0.208us
    18group_jacobi_var: min 0.179us / avg 0.186us / max 0.197us
    19wnaf_const: min 0.118us / avg 0.120us / max 0.124us
    20ecmult_wnaf: min 0.447us / avg 0.454us / max 0.465us
    21hash_sha256: min 0.252us / avg 0.257us / max 0.265us
    22hash_hmac_sha256: min 1.01us / avg 1.02us / max 1.06us
    23hash_rfc6979_hmac_sha256: min 5.56us / avg 5.72us / max 6.10us
    24context_verify: min 2652us / avg 2732us / max 2909us
    25context_sign: min 28.1us / avg 31.1us / max 33.2us
    26num_jacobi: min 0.0881us / avg 0.0893us / max 0.0913us
    27
    28$ ./bench_internal 
    29scalar_add: min 0.00793us / avg 0.00834us / max 0.00901us
    30scalar_negate: min 0.00285us / avg 0.00290us / max 0.00294us
    31scalar_sqr: min 0.0302us / avg 0.0310us / max 0.0321us
    32scalar_mul: min 0.0312us / avg 0.0318us / max 0.0323us
    33scalar_inverse: min 8.99us / avg 9.21us / max 9.47us
    34scalar_inverse_var: min 2.16us / avg 2.26us / max 2.39us
    35field_normalize: min 0.00753us / avg 0.00767us / max 0.00792us
    36field_normalize_weak: min 0.00305us / avg 0.00313us / max 0.00319us
    37field_sqr: min 0.0147us / avg 0.0152us / max 0.0158us
    38field_mul: min 0.0194us / avg 0.0196us / max 0.0200us
    39field_inverse: min 4.20us / avg 4.25us / max 4.34us
    40field_inverse_var: min 2.15us / avg 2.19us / max 2.26us
    41field_sqrt: min 4.16us / avg 4.19us / max 4.26us
    42group_double_var: min 0.126us / avg 0.128us / max 0.130us
    43group_add_var: min 0.291us / avg 0.293us / max 0.300us
    44group_add_affine: min 0.250us / avg 0.251us / max 0.252us
    45group_add_affine_var: min 0.208us / avg 0.213us / max 0.223us
    46group_jacobi_var: min 0.182us / avg 0.196us / max 0.204us
    47wnaf_const: min 0.124us / avg 0.126us / max 0.129us
    48ecmult_wnaf: min 0.456us / avg 0.464us / max 0.476us
    49hash_sha256: min 0.261us / avg 0.266us / max 0.274us
    50hash_hmac_sha256: min 1.04us / avg 1.06us / max 1.08us
    51hash_rfc6979_hmac_sha256: min 5.72us / avg 5.76us / max 5.87us
    52context_verify: min 2720us / avg 2789us / max 2922us
    53context_sign: min 28.4us / avg 29.6us / max 31.9us
    54num_jacobi: min 0.0867us / avg 0.0882us / max 0.0906us
    

    After:

     0$ ./bench_internal 
     1scalar_add: min 0.00950us / avg 0.00985us / max 0.0104us
     2scalar_negate: min 0.00338us / avg 0.00343us / max 0.00365us
     3scalar_sqr: min 0.0303us / avg 0.0311us / max 0.0325us
     4scalar_mul: min 0.0319us / avg 0.0325us / max 0.0335us
     5scalar_inverse: min 9.05us / avg 9.15us / max 9.39us
     6scalar_inverse_var: min 2.13us / avg 2.20us / max 2.35us
     7field_normalize: min 0.00853us / avg 0.00864us / max 0.00879us
     8field_normalize_weak: min 0.00394us / avg 0.00401us / max 0.00405us
     9field_sqr: min 0.0151us / avg 0.0154us / max 0.0161us
    10field_mul: min 0.0192us / avg 0.0198us / max 0.0204us
    11field_inverse: min 4.20us / avg 4.24us / max 4.34us
    12field_inverse_var: min 2.17us / avg 2.19us / max 2.22us
    13field_sqrt: min 4.18us / avg 4.25us / max 4.36us
    14group_double_var: min 0.128us / avg 0.129us / max 0.131us
    15group_add_var: min 0.294us / avg 0.298us / max 0.303us
    16group_add_affine: min 0.254us / avg 0.259us / max 0.268us
    17group_add_affine_var: min 0.212us / avg 0.215us / max 0.220us
    18group_jacobi_var: min 0.187us / avg 0.196us / max 0.205us
    19wnaf_const: min 0.124us / avg 0.126us / max 0.129us
    20ecmult_wnaf: min 0.469us / avg 0.475us / max 0.482us
    21hash_sha256: min 0.262us / avg 0.267us / max 0.272us
    22hash_hmac_sha256: min 1.05us / avg 1.06us / max 1.08us
    23hash_rfc6979_hmac_sha256: min 5.84us / avg 5.88us / max 5.97us
    24context_verify: min 2832us / avg 2899us / max 2999us
    25context_sign: min 29.9us / avg 31.5us / max 32.9us
    26num_jacobi: min 0.670us / avg 0.684us / max 0.699us
    27
    28$ ./bench_internal 
    29scalar_add: min 0.00910us / avg 0.00965us / max 0.0108us
    30scalar_negate: min 0.00326us / avg 0.00331us / max 0.00337us
    31scalar_sqr: min 0.0297us / avg 0.0304us / max 0.0316us
    32scalar_mul: min 0.0309us / avg 0.0316us / max 0.0320us
    33scalar_inverse: min 8.87us / avg 8.97us / max 9.08us
    34scalar_inverse_var: min 2.06us / avg 2.17us / max 2.27us
    35field_normalize: min 0.00828us / avg 0.00843us / max 0.00863us
    36field_normalize_weak: min 0.00383us / avg 0.00388us / max 0.00393us
    37field_sqr: min 0.0144us / avg 0.0148us / max 0.0153us
    38field_mul: min 0.0185us / avg 0.0189us / max 0.0197us
    39field_inverse: min 4.15us / avg 4.17us / max 4.26us
    40field_inverse_var: min 2.09us / avg 2.11us / max 2.15us
    41field_sqrt: min 4.09us / avg 4.11us / max 4.14us
    42group_double_var: min 0.125us / avg 0.126us / max 0.128us
    43group_add_var: min 0.284us / avg 0.286us / max 0.287us
    44group_add_affine: min 0.245us / avg 0.246us / max 0.249us
    45group_add_affine_var: min 0.205us / avg 0.207us / max 0.210us
    46group_jacobi_var: min 0.185us / avg 0.192us / max 0.198us
    47wnaf_const: min 0.119us / avg 0.121us / max 0.124us
    48ecmult_wnaf: min 0.456us / avg 0.462us / max 0.478us
    49hash_sha256: min 0.248us / avg 0.257us / max 0.261us
    50hash_hmac_sha256: min 1.02us / avg 1.05us / max 1.09us
    51hash_rfc6979_hmac_sha256: min 5.58us / avg 5.63us / max 5.72us
    52context_verify: min 2652us / avg 2668us / max 2730us
    53context_sign: min 27.7us / avg 29.0us / max 31.4us
    54num_jacobi: min 0.650us / avg 0.656us / max 0.663us
    

    It does seem like there’s a big difference in num_jacobi (from 0.0882us before to 0.656us)

  2. elichai force-pushed on Sep 27, 2019
  3. jonasnick commented at 12:49 pm on September 28, 2019: contributor
    Interesting. field_normalize_weak seems to be significantly affected as well.
  4. elichai commented at 7:56 pm on September 30, 2019: contributor

    So I ran perf only on the jacobi benchmarks. with and without the ASM escaping and these are the results: with escaping:

     0+   57.16%    57.08%  bench_internal  libgmp.so.10.3.2  [.] __gmpn_hgcd2_jacobi                                                                                                                                    
     1+   20.99%     0.00%  bench_internal  [unknown]         [.] 0x495641000143c33d                                                                                                                                     
     2+   20.99%     0.00%  bench_internal  libc-2.29.so      [.] __libc_start_main                                                                                                                                      
     3+   20.99%     0.00%  bench_internal  bench_internal    [.] main                                                                                                                                                   
     4+   20.99%     0.00%  bench_internal  bench_internal    [.] run_benchmark                                                                                                                                          
     5+   17.73%     0.57%  bench_internal  bench_internal    [.] bench_num_jacobi                                                                                                                                       
     6+   16.67%     1.29%  bench_internal  libgmp.so.10.3.2  [.] __gmpz_jacobi                                                                                                                                          
     7+   10.45%    10.41%  bench_internal  libgmp.so.10.3.2  [.] __gmpn_jacobi_base                                                                                                                                     
     8+    3.49%     3.24%  bench_internal  libgmp.so.10.3.2  [.] __gmpn_jacobi_n                                                                                                                                        
     9+    3.38%     3.30%  bench_internal  libgmp.so.10.3.2  [.] __gmpn_jacobi_2                                                                                                                                        
    10+    3.02%     2.76%  bench_internal  libgmp.so.10.3.2  [.] __gmpn_submul_1_coreisbr                                                                                                                               
    11+    2.63%     2.41%  bench_internal  libgmp.so.10.3.2  [.] __gmpn_matrix22_mul1_inverse_vector                                                                                                                    
    12+    2.47%     2.33%  bench_internal  libgmp.so.10.3.2  [.] __gmpz_import                                                                                                                                          
    13+    2.33%     2.33%  bench_internal  libc-2.29.so      [.] _int_free                                                                                                                                              
    14+    2.31%     2.07%  bench_internal  libgmp.so.10.3.2  [.] __gmpn_mul_1_coreisbr                                                                                                                                  
    15+    2.11%     2.09%  bench_internal  libc-2.29.so      [.] _int_malloc                                                                                                                                            
    16+    2.10%     2.00%  bench_internal  libc-2.29.so      [.] realloc                                                                                                                                                
    17+    1.78%     0.00%  bench_internal  [unknown]         [.] 0x0000000000000031                                                                                                                                     
    18+    1.70%     0.00%  bench_internal  [unknown]         [.] 0x34d15553d182f5f5                                                                                                                                     
    19+    1.48%     1.36%  bench_internal  libc-2.29.so      [.] _int_realloc                                                                                                                                           
    20+    1.42%     0.00%  bench_internal  [unknown]         [.] 0xdbdde3e7e9eff3f9                                                                                                                                     
    21+    1.42%     0.00%  bench_internal  [unknown]         [.] 0x301c1e78e20023fb                                                                                                                                     
    22+    1.18%     0.00%  bench_internal  [unknown]         [.] 0x5b0274c085480008                                                                                                                                     
    23+    1.18%     0.00%  bench_internal  [unknown]         [.] 0x0406020604020402                                                                                                                                     
    24+    1.15%     1.05%  bench_internal  libgmp.so.10.3.2  [.] __gmpn_copyi_core2                                                                                                                                     
    25+    1.15%     1.03%  bench_internal  libgmp.so.10.3.2  [.] __gmpz_inits                                                                                                                                           
    26+    1.14%     1.10%  bench_internal  libc-2.29.so      [.] malloc                                                                                                                                                 
    27+    1.03%     0.99%  bench_internal  libgmp.so.10.3.2  [.] __gmpz_clears                                                                                                                                          
    28+    0.89%     0.75%  bench_internal  libgmp.so.10.3.2  [.] __gmpz_realloc                                                                                                                                         
    29+    0.87%     0.00%  bench_internal  [unknown]         [.] 0x3b3d4347494f5359                                                                                                                                     
    30+    0.81%     0.00%  bench_internal  [unknown]         [.] 0000000000000000                                                                                                                                       
    31+    0.73%     0.00%  bench_internal  [unknown]         [.] 0x450b2616e79bd99c                                                                                                                                     
    32+    0.57%     0.00%  bench_internal  [unknown]         [.] 0x34d12f0e848242bb                                                                                                                                     
    33+    0.53%     0.39%  bench_internal  libc-2.29.so      [.] cfree@GLIBC_2.2.5                                                                                                                                      
    34+    0.47%     0.00%  bench_internal  [unknown]         [.] 0x0000000000000021                                                                                                                                     
    35+    0.42%     0.00%  bench_internal  [unknown]         [.] 0x6b87c1525966af45                                                                                                                                     
    36+    0.40%     0.00%  bench_internal  [unknown]         [.] 0x000000000000025c                                                                                                                                     
    37+    0.38%     0.38%  bench_internal  libc-2.29.so      [.] malloc_consolidate                                                                                                                                     
    38+    0.36%     0.24%  bench_internal  libgmp.so.10.3.2  [.] __gmp_default_allocate                                                                                                                                 
    39+    0.36%     0.26%  bench_internal  libgmp.so.10.3.2  [.] __gmp_default_reallocate                                                                                                                               
    40+    0.25%     0.00%  bench_internal  [unknown]         [.] 0x00005609b0ef0010                                                                                                                                     
    41+    0.18%     0.10%  bench_internal  bench_internal    [.] __gmpz_import@plt                                                                                                                                      
    42+    0.16%     0.12%  bench_internal  libc-2.29.so      [.] __memmove_avx_unaligned_erms                                                                                                                           
    43+    0.14%     0.14%  bench_internal  libgmp.so.10.3.2  [.] __gmp_default_free                                                                                                                                     
    44+    0.06%     0.06%  bench_internal  bench_internal    [.] __gmpz_jacobi@plt                                                                                                                                      
    45+    0.06%     0.06%  bench_internal  bench_internal    [.] __gmpz_inits@plt                                                                                                                                       
    46+    0.04%     0.04%  bench_internal  libc-2.29.so      [.] 0x00000000000254c4                                                                                                                                     
    47+    0.02%     0.02%  bench_internal  bench_internal    [.] __gmpz_clears@plt                                                                                                                                      
    48+    0.02%     0.00%  bench_internal  libc-2.29.so      [.] 0x00007f386150f4c0                                                                                                                                     
    49+    0.01%     0.01%  bench_internal  ld-2.29.so        [.] do_lookup_x                                                                                                                                            
    50+    0.01%     0.00%  bench_internal  [unknown]         [.] 0x00007f38617a7c70                                                                                                                                     
    

    without:

     0+   20.14%     0.00%  bench_internal  [unknown]         [.] 0x495641000144033d                                                                                                                                     
     1+   20.14%     0.00%  bench_internal  libc-2.29.so      [.] __libc_start_main                                                                                                                                      
     2+   20.14%     0.00%  bench_internal  bench_internal    [.] main                                                                                                                                                   
     3+   20.14%     0.00%  bench_internal  bench_internal    [.] run_benchmark                                                                                                                                          
     4+   15.52%    15.52%  bench_internal  libc-2.29.so      [.] _int_malloc                                                                                                                                            
     5+   14.86%    14.86%  bench_internal  libc-2.29.so      [.] _int_free                                                                                                                                              
     6+   12.65%    10.90%  bench_internal  libgmp.so.10.3.2  [.] __gmpz_import                                                                                                                                          
     7+   12.65%     0.00%  bench_internal  [unknown]         [.] 0x0000000000000031                                                                                                                                     
     8+   10.55%     9.59%  bench_internal  libc-2.29.so      [.] realloc                                                                                                                                                
     9+    8.48%     8.48%  bench_internal  libc-2.29.so      [.] _int_realloc                                                                                                                                           
    10+    6.11%     6.11%  bench_internal  libc-2.29.so      [.] malloc                                                                                                                                                 
    11+    5.78%     0.00%  bench_internal  [unknown]         [.] 0x3b3d4347494f5359                                                                                                                                     
    12+    5.76%     5.76%  bench_internal  libgmp.so.10.3.2  [.] __gmpn_copyi_core2                                                                                                                                     
    13+    5.29%     4.50%  bench_internal  libgmp.so.10.3.2  [.] __gmpz_inits                                                                                                                                           
    14+    5.29%     4.96%  bench_internal  libgmp.so.10.3.2  [.] __gmpz_realloc                                                                                                                                         
    15+    4.77%     4.60%  bench_internal  libgmp.so.10.3.2  [.] __gmpz_clears                                                                                                                                          
    16+    4.31%     2.54%  bench_internal  libc-2.29.so      [.] cfree@GLIBC_2.2.5                                                                                                                                      
    17+    4.16%     0.00%  bench_internal  [unknown]         [.] 0x5b0274c085480008                                                                                                                                     
    18+    4.16%     0.00%  bench_internal  [unknown]         [.] 0x0406020604020402                                                                                                                                     
    19+    3.94%     0.00%  bench_internal  [unknown]         [.] 0000000000000000                                                                                                                                       
    20+    3.20%     3.20%  bench_internal  bench_internal    [.] bench_num_jacobi                                                                                                                                       
    21+    3.19%     0.00%  bench_internal  [unknown]         [.] 0x0000000000000021                                                                                                                                     
    22+    2.86%     1.77%  bench_internal  libgmp.so.10.3.2  [.] __gmp_default_free                                                                                                                                     
    23+    2.26%     1.94%  bench_internal  libgmp.so.10.3.2  [.] __gmp_default_reallocate                                                                                                                               
    24+    1.75%     1.75%  bench_internal  libc-2.29.so      [.] malloc_consolidate                                                                                                                                     
    25+    1.44%     1.44%  bench_internal  libc-2.29.so      [.] __memmove_avx_unaligned_erms                                                                                                                           
    26+    1.26%     0.79%  bench_internal  libgmp.so.10.3.2  [.] __gmp_default_allocate                                                                                                                                 
    27+    0.96%     0.32%  bench_internal  bench_internal    [.] __gmpz_import@plt                                                                                                                                      
    28+    0.95%     0.17%  bench_internal  bench_internal    [.] __gmpz_clears@plt                                                                                                                                      
    29+    0.91%     0.00%  bench_internal  [unknown]         [.] 0x000055f5626b3010                                                                                                                                     
    30+    0.32%     0.32%  bench_internal  bench_internal    [.] __gmpz_inits@plt                                                                                                                                       
    31+    0.32%     0.32%  bench_internal  libc-2.29.so      [.] 0x00000000000254c0                                                                                                                                     
    32+    0.32%     0.00%  bench_internal  libc-2.29.so      [.] 0x00007f50fca414c4                                                                                                                                     
    33+    0.12%     0.12%  bench_internal  ld-2.29.so        [.] do_lookup_x                                                                                                                                            
    34+    0.12%     0.00%  bench_internal  [unknown]         [.] 0x00007f50fccd9c70                                                                                                                                     
    35+    0.05%     0.05%  bench_internal  ld-2.29.so        [.] strcmp                                                                                                                                                 
    36+    0.05%     0.00%  bench_internal  [unknown]         [.] 0x54495f0030312e6f                                                                                                                                     
    37+    0.01%     0.01%  bench_internal  [unknown]         [k] 0xffffffffb8e00b07                                                                                                                                     
    38+    0.01%     0.00%  bench_internal  ld-2.29.so        [.] _dl_start_user                                                                                                                                         
    39+    0.01%     0.00%  bench_internal  ld-2.29.so        [.] _dl_sysdep_start                                                                                                                                       
    40+    0.00%     0.00%  bench_internal  ld-2.29.so        [.] _start                                                                                                                                                                                                                                                                                                   
    

    It doesn’t seem like it’s doing anything except allocating memory

  5. elichai commented at 8:18 pm on September 30, 2019: contributor
    Looking at the assembly of field_normalize_weak. it generates quite a different instructions which i’m not sure what to make of them. hypothetically it should work the same as we are using the output of the function (as the output for the next call). I think that it might be trying to unroll some of the loop when the escaping isn’t there (it’s harder for it to unroll the loop when there’s stuff there that it doesn’t understand that can do anything it wants to the memory)
  6. in src/bench_internal.c:30 in 2bf75d4b4e outdated
    26@@ -27,6 +27,7 @@ typedef struct {
    27     int wnaf[256];
    28 } bench_inv;
    29 
    30+
    


    jonasnick commented at 12:31 pm on October 7, 2019:
    nit: unnecessary newline?
  7. in src/bench.h:15 in 2bf75d4b4e outdated
    11@@ -12,6 +12,10 @@
    12 #include <math.h>
    13 #include "sys/time.h"
    14 
    15+static void escape(void *p) {
    


    jonasnick commented at 12:32 pm on October 7, 2019:
    nit: Perhaps rename to memory fence/barrier to make this more clear and add a comment what this is for?

    elichai commented at 12:51 pm on October 7, 2019:
    Yeah i’ll rename and add a comment, maybe even an inline hint?
  8. jonasnick commented at 12:41 pm on October 7, 2019: contributor

    I’m getting similar numbers to @elichai. Interestingly with clang the jacobi computation isn’t optimized out on master.

    Unfortunately this PR doesn’t solve the problem completely because the memory barrier as implemented here doesn’t work with all compilers. In particular, it works with gcc and clang but not with Microsoft Visual C afaik. Ideally we would check the results of the individual benchmark computations in bench_teardown similar to how the other bench binaries work. Perhaps bench_internal could even print a note saying that the results are to be taken with a grain of salt if not on gcc/clang.

    Why didn’t you add the memory barrier to context_verify/sign? Doesn’t seem to make a difference though.

  9. elichai commented at 12:50 pm on October 7, 2019: contributor

    Unfortunately this PR doesn’t solve the problem completely because the memory barrier as implemented here doesn’t work with all compilers. In particular, it works with gcc and clang but not with Microsoft Visual C afaik. Ideally we would check the results of the individual benchmark computations in bench_teardown similar to how the other bench binaries work. Perhaps bench_internal could even print a note saying that the results are to be taken with a grain of salt if not on gcc/clang.

    I know almost nothing about windows/Microsoft’s compiler.

    Why didn’t you add the memory barrier to context_verify/sign? Doesn’t seem to make a difference though.

    Wanted to get some feedback before I started going over the rest of the benchmarks, mostly wanted to make sure this doesn’t affect the conclusions for BIP schnorr (I assume the jacobi symbol was tested as part of a full verification so the internal_benchmarks didn’t affect those decisions)

  10. real-or-random commented at 2:34 pm on October 7, 2019: contributor

    Unfortunately this PR doesn’t solve the problem completely because the memory barrier as implemented here doesn’t work with all compilers. In particular, it works with gcc and clang but not with Microsoft Visual C afaik. Ideally we would check the results of the individual benchmark computations in bench_teardown similar to how the other bench binaries work. Perhaps bench_internal could even print a note saying that the results are to be taken with a grain of salt if not on gcc/clang.

    I know almost nothing about windows/Microsoft’s compiler.

    https://github.com/google/benchmark/blob/7d97a057e16597e8020d0aca110480fe82c9ca67/include/benchmark/benchmark.h#L307 and the following lines may help

  11. gmaxwell commented at 1:48 am on October 8, 2019: contributor

    elichai: the internal benchmarks are just that, internal benchmarks… no one should be using them for systems comparisons; this isn’t an issue for bip schnorr.

    I don’t like this approach, it’s very compiler specific and doesn’t obviously preclude all sufficiently advanced optimizations– and some platforms the memory barrier might do something stupidly expensive like flush the TLBs or something. It’s better to set things up so that that data gets used.

    I’ve sometimes seen microbenchmarks print a final value, which can sometimes help catch other bugs or changes. (e.g. you make some optimizations that shouldn’t change behaviour and the final value changes). If that was required in order to get the results to print, it wouldn’t be the end of the word.

    For Jacobi simply carrying a running sum of the results would probably be fine.

    As a side, the jacobi symbol in this code is currently GMP only, the QR test without GMP (e.g. as bitcoin ships) is done with a grievously slow constant time exponentiation ladder. … that should probably be someone’s priority, even if it’s not an outright deployment blocker.

  12. elichai commented at 1:54 am on October 8, 2019: contributor

    @gmaxwell I assumed so, just wanted to mention it just in case (bip schnorr doesn’t specify a specific benchmark)

    I’m not sure about the TLB thing because this memory barrier results in 0 more instructions. It’s just to make the optimizer not optimize the result out (and by that also maybe the whole call graph)

    About a different approach I can try and rewrite those benchmarks so that they all use the output from the function they’re benchmarking and in the end of the loop assert on it in some way (so that it won’t optimize from the end)

  13. gmaxwell commented at 5:38 pm on October 8, 2019: contributor
    Aside, http://sape.inf.usi.ch/publications/asplos09 should be required reading for anyone attempting to use microbenchmarks like these. (and perhaps https://github.com/ccurtsinger/stabilizer )
  14. Add memory fence to bench_internal f423d2e14d
  15. elichai force-pushed on Oct 21, 2019
  16. elichai commented at 9:23 pm on October 21, 2019: contributor

    Ok, I tried to remove as much fences as possible and use simple logic to do the same. in functions that the inputs and outputs were the same I added one fence at the end instead of every iteration (because at the end of the iterations the output isn’t used). Also added an __always_inline__ to the fence function. if people prefer to just print the whole data structure at the end of the benchmarks this might allow us to remove most of the fences.

    Now the stats look like this:

     0scalar_add: min 0.00757us / avg 0.00800us / max 0.00820us
     1scalar_negate: min 0.00288us / avg 0.00290us / max 0.00293us
     2scalar_sqr: min 0.0303us / avg 0.0307us / max 0.0312us
     3scalar_mul: min 0.0312us / avg 0.0316us / max 0.0321us
     4scalar_inverse: min 8.80us / avg 8.95us / max 9.07us
     5scalar_inverse_var: min 2.10us / avg 2.15us / max 2.31us
     6field_normalize: min 0.00745us / avg 0.00772us / max 0.00846us
     7field_normalize_weak: min 0.00299us / avg 0.00303us / max 0.00313us
     8field_sqr: min 0.0145us / avg 0.0149us / max 0.0153us
     9field_mul: min 0.0185us / avg 0.0189us / max 0.0193us
    10field_inverse: min 4.18us / avg 4.21us / max 4.26us
    11field_inverse_var: min 2.11us / avg 2.15us / max 2.18us
    12field_sqrt: min 4.07us / avg 4.10us / max 4.16us
    13group_double_var: min 0.124us / avg 0.127us / max 0.131us
    14group_add_var: min 0.290us / avg 0.292us / max 0.298us
    15group_add_affine: min 0.244us / avg 0.248us / max 0.252us
    16group_add_affine_var: min 0.203us / avg 0.206us / max 0.209us
    17group_jacobi_var: min 0.752us / avg 0.759us / max 0.797us
    18wnaf_const: min 0.116us / avg 0.120us / max 0.122us
    19ecmult_wnaf: min 0.443us / avg 0.453us / max 0.467us
    20hash_sha256: min 0.249us / avg 0.251us / max 0.255us
    21hash_hmac_sha256: min 1.01us / avg 1.02us / max 1.05us
    22hash_rfc6979_hmac_sha256: min 5.56us / avg 5.59us / max 5.67us
    23context_verify: min 2635us / avg 2646us / max 2660us
    24context_sign: min 27.3us / avg 27.9us / max 28.1us
    25num_jacobi: min 0.631us / avg 0.637us / max 0.646us
    26  Time (mean ± σ):      8.334 s ±  0.059 s    [User: 8.327 s, System: 0.000 s]
    27  Range (min  max):    8.185 s   8.387 s    10 runs
    
  17. in src/bench_internal.c:106 in 79dbc5628c outdated
    105     for (i = 0; i < 20000; i++) {
    106-        secp256k1_scalar l, r;
    107-        secp256k1_scalar_split_lambda(&l, &r, &data->scalar_x);
    108-        secp256k1_scalar_add(&data->scalar_x, &data->scalar_x, &data->scalar_y);
    109+        secp256k1_scalar_split_lambda(&data->scalar_x, &data->scalar_y, &data->scalar_x);
    110+        j += secp256k1_scalar_add(&data->scalar_x, &data->scalar_x, &data->scalar_y);
    


    elichai commented at 9:28 pm on October 21, 2019:

    I think this should be fine because it solves for x=a-y*lambda and y=(a-x)/lambda so adding them up again every time should not result back in the same a and should give a different result.

    But please tell me if there’s something here I’m missing(maybe it will make it go to zero very quickly? (though I don’t think so because this is all in Fp so it overflows) but if it does I can replace the addition with multiplication if it’s any better)

  18. Remove memory fences where we can do better a5198c52ba
  19. elichai force-pushed on Oct 21, 2019
  20. apoelstra commented at 9:08 pm on October 22, 2019: contributor

    Worth revisiting #290 in light of this PR.

    Perhaps my frustration at GMP being impossibly faster than my own implementation was just an illusion?

  21. elichai commented at 9:19 pm on October 22, 2019: contributor
    Maybe. I can try to rebase and see how the benchmarks compare (actually your implementation has side effect. so it makes sense (you use the CHECK macro in some of the functions))
  22. apoelstra commented at 9:34 pm on October 22, 2019: contributor
    I would be extremely curious to see how my implementation compares, assuming that a rebase is feasible. (The bulk of the code is in the new num_* files as I recall, but the integration with the rest of the library may need to be rewritten at this point..)
  23. elichai commented at 9:43 pm on October 22, 2019: contributor
    Will give it a try tomorrow. If the speed does come close it would be pretty awesome for the future schnorr verification (we need it to at least be better than secp256k1_fe_sqrt)
  24. elichai commented at 1:37 pm on October 23, 2019: contributor

    @apoelstra https://github.com/elichai/secp256k1/tree/rebased-jacobi Used hyperfine -w 5 -m 20 --show-output './bench_internal' yours 64bit:

     0Benchmark [#1](/bitcoin-core-secp256k1/1/): ./bench_internal
     1num_jacobi: min 4.12us / avg 4.16us / max 4.20us
     2num_jacobi: min 3.99us / avg 4.08us / max 4.10us
     3num_jacobi: min 4.11us / avg 4.12us / max 4.13us
     4num_jacobi: min 4.27us / avg 4.30us / max 4.36us
     5num_jacobi: min 4.18us / avg 4.19us / max 4.21us
     6num_jacobi: min 4.18us / avg 4.19us / max 4.19us
     7num_jacobi: min 4.28us / avg 4.33us / max 4.58us
     8num_jacobi: min 4.33us / avg 4.47us / max 4.62us
     9num_jacobi: min 4.26us / avg 4.32us / max 4.54us
    10num_jacobi: min 4.37us / avg 4.39us / max 4.41us
    11num_jacobi: min 4.15us / avg 4.30us / max 4.43us
    12num_jacobi: min 4.22us / avg 4.38us / max 4.56us
    13num_jacobi: min 4.34us / avg 4.46us / max 4.52us
    14num_jacobi: min 4.35us / avg 4.37us / max 4.39us
    15num_jacobi: min 4.17us / avg 4.23us / max 4.42us
    16num_jacobi: min 4.15us / avg 4.16us / max 4.18us
    17num_jacobi: min 4.09us / avg 4.13us / max 4.21us
    18num_jacobi: min 4.22us / avg 4.25us / max 4.27us
    19num_jacobi: min 4.17us / avg 4.22us / max 4.26us
    20num_jacobi: min 4.19us / avg 4.23us / max 4.28us
    21num_jacobi: min 4.19us / avg 4.22us / max 4.29us
    22num_jacobi: min 4.18us / avg 4.20us / max 4.26us
    23num_jacobi: min 4.18us / avg 4.21us / max 4.30us
    24num_jacobi: min 4.17us / avg 4.19us / max 4.22us
    25num_jacobi: min 4.17us / avg 4.18us / max 4.19us
    26  Time (mean ± σ):      8.542 s ±  0.199 s    [User: 8.533 s, System: 0.001 s]
    27  Range (min … max):    8.251 s …  8.934 s    20 runs
    

    GMP:

     0Benchmark [#1](/bitcoin-core-secp256k1/1/): ./bench_internal
     1num_jacobi: min 0.613us / avg 0.628us / max 0.636us
     2num_jacobi: min 0.600us / avg 0.612us / max 0.624us
     3num_jacobi: min 0.609us / avg 0.641us / max 0.708us
     4num_jacobi: min 0.644us / avg 0.676us / max 0.730us
     5num_jacobi: min 0.658us / avg 0.670us / max 0.705us
     6num_jacobi: min 0.627us / avg 0.635us / max 0.648us
     7num_jacobi: min 0.647us / avg 0.677us / max 0.739us
     8num_jacobi: min 0.657us / avg 0.679us / max 0.691us
     9num_jacobi: min 0.647us / avg 0.673us / max 0.700us
    10num_jacobi: min 0.638us / avg 0.680us / max 0.718us
    11num_jacobi: min 0.644us / avg 0.702us / max 0.784us
    12num_jacobi: min 0.647us / avg 0.713us / max 0.780us
    13num_jacobi: min 0.640us / avg 0.699us / max 0.792us
    14num_jacobi: min 0.647us / avg 0.689us / max 0.749us
    15num_jacobi: min 0.684us / avg 0.742us / max 0.819us
    16num_jacobi: min 0.672us / avg 0.691us / max 0.711us
    17num_jacobi: min 0.639us / avg 0.752us / max 0.806us
    18num_jacobi: min 0.679us / avg 0.719us / max 0.788us
    19num_jacobi: min 0.673us / avg 0.698us / max 0.749us
    20num_jacobi: min 0.668us / avg 0.684us / max 0.722us
    21num_jacobi: min 0.668us / avg 0.686us / max 0.724us
    22num_jacobi: min 0.639us / avg 0.652us / max 0.665us
    23num_jacobi: min 0.648us / avg 0.663us / max 0.706us
    24num_jacobi: min 0.649us / avg 0.658us / max 0.671us
    25num_jacobi: min 0.649us / avg 0.670us / max 0.698us
    26  Time (mean ± σ):      1.377 s ±  0.058 s    [User: 1.374 s, System: 0.000 s]
    27  Range (min … max):    1.270 s …  1.506 s    20 runs
    
  25. elichai cross-referenced this on Oct 23, 2019 from issue Add native num implementation; modular inverse and Jacobi symbol without GMP by apoelstra
  26. apoelstra commented at 9:07 pm on October 23, 2019: contributor
    Ok, 8x is better than 50x I suppose :)
  27. jonasnick commented at 10:31 am on October 28, 2019: contributor
    Before progress on this PR stalls because of the the memory fences, how about we add just the num_jacobi fix for now (can be separate PR)?
  28. real-or-random commented at 10:38 am on October 28, 2019: contributor

    Before progress on this PR stalls because of the the memory fences, how about we add just the num_jacobi fix for now (can be separate PR)?

    I agree. Let’s get this part merged and we can still have a discussion about memory fences in another PR maybe or on IRC.

  29. elichai cross-referenced this on Oct 28, 2019 from issue Preventing compiler optimizations in benchmarks without a memory fence by elichai
  30. elichai commented at 2:31 pm on October 28, 2019: contributor
  31. jonasnick referenced this in commit 544002c008 on Nov 18, 2019
  32. real-or-random commented at 10:29 am on November 25, 2019: contributor

    Now after #678 has been merged, this here is mostly related to memory fences. Let me note that my plan was to add a fence in #636 anyway for clearing memory ( with a #ifdef though).

    By the way, the fence here is:

    0     __asm__ __volatile__("": : "g"(p) : "memory"); 
    

    whereas the one in #636 (that I also added to Bitcoin Core in bitcoin/bitcoin#16158) is

    0    __asm__ __volatile__("" : : "r"(p) : "memory");
    

    As far as I can tell from the docs, those should be equivalent but I’m really not an expert here.

  33. elichai cross-referenced this on Mar 1, 2020 from issue Implement go bindings to secp256k1 by elichai
  34. jonasnick cross-referenced this on Mar 24, 2020 from issue Consider removing x86_64 field assembly by jonasnick
  35. real-or-random commented at 2:52 pm on April 27, 2020: contributor

    As far as I can tell from the docs, those should be equivalent but I’m really not an expert here.

    It seems that the g version is slighter cleaner because it does not require the compiler to allocate a register for a ASM noop, see https://bugs.llvm.org/show_bug.cgi?id=15495#c11.

    edit: actually not sure in practice. I guess the pointer is usually in a register anywhere, and r/g doesn’t make a difference then for GCC. LLVM will always pass the pointer via memory if I use g, so this will always need a useless memory access. For this PR, this is anyway not important but I would like to be careful in #636. I think keeping the r variant is better, also because this seems more commonly used, e.g., in the Linux kernel and in Core. So the change is higher that someone will notice if this breaks in the future.

  36. sipa commented at 3:11 pm on May 8, 2023: contributor

    I’m not sure there is an issue anymore; the jacobi benchmark was bad, but it has been fixed (and removed, and re-added, …).

    For added confidence, I think there are possibilities to avoiding the compiler optimizing some things away that don’t actually need a memory fence (e.g. printing a value that’s a function of all benchmark’s success counters, or comparing success counters after reading through volatile like CHECK(*((volatile int*)&count) <= iters);).

  37. sipa closed this on May 8, 2023


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin-core/secp256k1. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-11-23 22:15 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me