As asked #667 (comment) this is the parts of #667 that don’t require an assembly memory fence.
I splitted them to 2 commits, one with obvious easy ones. and another that changes the logic a bit to achieve this (See #667 (review) )
As asked #667 (comment) this is the parts of #667 that don’t require an assembly memory fence.
I splitted them to 2 commits, one with obvious easy ones. and another that changes the logic a bit to achieve this (See #667 (review) )
245+ CHECK(j == 20000);
246 }
247
248 void bench_ecmult_wnaf(void* arg) {
249- int i;
250+ int i, bits=0, overflow = 0;
0 int i, bits = 0, overflow = 0;
184@@ -182,15 +185,16 @@ void bench_field_inverse_var(void* arg) {
185 }
186
187 void bench_field_sqrt(void* arg) {
188- int i;
189+ int i, j=0;
0 int i, j = 0;
104- secp256k1_scalar_split_lambda(&l, &r, &data->scalar_x);
105- secp256k1_scalar_add(&data->scalar_x, &data->scalar_x, &data->scalar_y);
106+ secp256k1_scalar_split_lambda(&data->scalar_x, &data->scalar_y, &data->scalar_x);
107+ j += secp256k1_scalar_add(&data->scalar_x, &data->scalar_x, &data->scalar_y);
108 }
109+ CHECK(j <= 2000);
<= 20000
(i.e. it’s missing a 0)?
code changes in 362bb25608dbcd724a07dd5170c4ebe081c3dd84 look good to me.
This fixes the two jacobi benchmark on gcc but for some reason it makes some of the benchmarks in clang noticeably faster (a few percents) on my machine, see the spreadsheet here: https://gist.github.com/real-or-random/68cf2e2ba7d9ab1fc6a50a7e3698fc8f/raw/e13e2485628016976cedc2ad58180050ddc6c071/362bb25.ods Any idea why this is the case? Can you reproduce this?
that’s my stats: https://gist.github.com/elichai/7cc8ac137c83565768d87387efae96bf
Seems like clang didn’t optimize this from the beginning. Not sure why it’s faster but in my machine it’s only by 8us
Anyway this is certainly an improvement.
ACK 362bb25608dbcd724a07dd5170c4ebe081c3dd84 I read the diff and I ran the benchmarks