ecmult_multi: Replace scratch space with malloc, use abcd cost model #1789

pull fjahr wants to merge 2 commits into bitcoin-core:master from fjahr:2025-11-mem-multi-var changing 7 files +903 −837
  1. fjahr commented at 11:47 PM on December 15, 2025: contributor

    This is an implementation of the discussed changes from an in-person meeting in October. It removes usage of scratch space in batch validation and replaces it with internal malloc usage. It also adds an ABCD cost model for algorithm selection.

    The API and internals follow the drafted spec from the meeting very closely: https://gist.github.com/fjahr/c2a009487dffe7a1fbf17ca1821976ca There are few minor changes that should not change the intended behavior. The test coverage may be currently a bit lower than it was previously. I am guessing an adapted form of the test_ecmult_multi test should be added back and there are some TODOs left in the test code which I am planning to address after a first round of conceptual feedback.

    The second commit demonstrates the calibration tooling that I have been using though it's not exactly what has given me the results that are in the PR. I have still been struggling with the calibration code and seem to never really get a result that just works without manual tweaking. The calibration itself as well as the code added there thus is rather a work in progress. I am assuming some version of the calibration code should be added to the repo and I haven't thought much about what the best place to add it is. Putting it into the benchmark and combining it with the python script was just a convenient way for experimentation. I am very open to suggestions on how to change this.

  2. fjahr force-pushed on Dec 16, 2025
  3. in src/modules/musig/keyagg_impl.h:152 in 8760b3c148 outdated
     163 | -#endif
     164 | -    secp256k1_musig_keyaggcoef_internal(sc, ctx->pks_hash, pt, &ctx->second_pk);
     165 | -    return 1;
     166 | -}
     167 | -
     168 |  int secp256k1_musig_pubkey_agg(const secp256k1_context* ctx, secp256k1_xonly_pubkey *agg_pk, secp256k1_musig_keyagg_cache *keyagg_cache, const secp256k1_pubkey * const* pubkeys, size_t n_pubkeys) {
    


    siv2r commented at 1:33 PM on December 26, 2025:

    It might be better to pass mem_limit as an input argument rather than defining it internally


    fjahr commented at 4:26 PM on January 20, 2026:

    Hm, I see the point that this would align with the batch validation API that has been discussed but I am not sure that this justifies changing the existing API of musig. It would probably better to have this in a separate/follow-up PR but I will wait a bit to see what other reviewers think about this.

  4. in src/modules/musig/keyagg_impl.h:206 in 8760b3c148
     207 | +        secp256k1_musig_keyaggcoef_internal(&scalars[i], ecmult_data.pks_hash, &points[i], &ecmult_data.second_pk);
     208 | +    }
     209 | +
     210 | +    /* TODO: Previously this just used TRIVIAL algorithm by not providing
     211 | +     * scratch space. To be discussed if this should be changed. */
     212 | +    mem_limit = n_pubkeys * (sizeof(secp256k1_ge) + sizeof(secp256k1_scalar)) * 4;
    


    siv2r commented at 1:51 PM on December 26, 2025:

    nit: Why the constant 4? I assume it covers internal overhead. I'm wondering whether we can define a ratio k such that mem_limit = n_points * (sizeof(secp256k1_ge) + sizeof(secp256k1_scalar)) * k yields a batch size of n_points when given to _ecmult_multi_batch_size()


    siv2r commented at 1:57 PM on December 26, 2025:

    Hmm, we could probably do mem_limit = max(STRAUSS_POINT_SIZE * n, PIPPENGER_POINT_SIZE * n + PIPPENGER_FIXED_SIZE). Just thinking out loud, not suggesting a change here.


    fjahr commented at 4:26 PM on January 20, 2026:

    This was just a rough heuristic that I put not much thought in because of the TODO above which I am still unsure about. Your suggestions are interesting and I think we could actually just do mem_limit = SECP256K1_STRAUSS_POINT_SIZE * n_pubkeys; because in 99.9% of musig cases Strauss should be the optimal algorithm anyway and this makes the reasoning of the code easier as well. Let me know what you think, I have updated the code with this for now.

  5. in src/tests_exhaustive.c:187 in 8760b3c148
     185 | @@ -186,17 +186,13 @@ typedef struct {
     186 |      secp256k1_ge pt[2];
     187 |  } ecmult_multi_data;
    


    siv2r commented at 2:02 PM on December 26, 2025:

    We can probably remove this data structure. In the old code, I only see this acting as cbdata (initialized using local scalars and points arrays) to the ecmult_multi_var


    fjahr commented at 4:26 PM on January 20, 2026:

    Oh, true, good catch!

  6. in src/bench_ecmult.c:339 in 8760b3c148
     332 | @@ -306,11 +333,11 @@ static void run_ecmult_multi_bench(bench_data* data, size_t count, int includes_
     333 |  int main(int argc, char **argv) {
     334 |      bench_data data;
     335 |      int i, p;
     336 | -    size_t scratch_size;
     337 |  
     338 |      int iters = get_iters(10000);
     339 |  
     340 | -    data.ecmult_multi = secp256k1_ecmult_multi_var;
     341 | +    data.forced_algo = -1;
    


    siv2r commented at 2:09 PM on December 26, 2025:

    nit: We could define #define BENCH_ALGO_AUTO (-1) to make this more readable. Then, the check on line 252 would be: if (data->forced_algo != BENCH_ALGO_AUTO)


    fjahr commented at 4:26 PM on January 20, 2026:

    Sounds good, taken.

  7. in src/ecmult_impl.h:586 in 8760b3c148 outdated
     636 | +
     637 | +/* Pippenger fixed overhead: buckets + state + 2 extra entries */
     638 | +#define SECP256K1_PIPPENGER_FIXED_SIZE(w) \
     639 | +    ((sizeof(secp256k1_gej) << (w)) \
     640 | +     + sizeof(struct secp256k1_pippenger_state) \
     641 | +     + 2 * SECP256K1_PIPPENGER_ENTRY_SIZE(w))
    


    siv2r commented at 2:15 PM on December 26, 2025:
    #define SECP256K1_PIPPENGER_FIXED_SIZE(w) \
        ((sizeof(secp256k1_gej) << (w)) \
         + 2 * SECP256K1_PIPPENGER_ENTRY_SIZE(w))
    

    We can remove the pippenger_state since it's a stack variable. IIUC, the old code allocated this variable on the heap, hence it included it in the size calculation.


    fjahr commented at 4:26 PM on January 20, 2026:

    That looks correct to me, taken!

  8. in src/ecmult_impl.h:732 in 8760b3c148
     858 | +            n * ECMULT_TABLE_SIZE(WINDOW_A) * sizeof(secp256k1_ge));
     859 | +        state.ps = (struct secp256k1_strauss_point_state *)checked_malloc(error_callback,
     860 | +            n * sizeof(struct secp256k1_strauss_point_state));
     861 | +        points_gej = (secp256k1_gej *)checked_malloc(error_callback,
     862 | +            n * sizeof(secp256k1_gej));
     863 | +        scalars_copy = (secp256k1_scalar *)checked_malloc(error_callback,
    


    siv2r commented at 2:22 PM on December 26, 2025:

    Is there a specific reason for copying the scalars here? Since they are managed by the upper layer (like batch verification or MuSig), could we just pass the pointer to _strauss_wnaf directly?


    fjahr commented at 4:27 PM on January 20, 2026:

    I think you are correct, I have removed it.

  9. in src/ecmult_impl.h:750 in 8760b3c148
     876 | +        secp256k1_ecmult_strauss_wnaf(&state, r, n, points_gej, scalars_copy, scalar_g);
     877 | +    } else {
     878 | +        secp256k1_gej tmpj;
     879 | +        secp256k1_gej_set_infinity(&tmpj);
     880 | +        secp256k1_ecmult(r, &tmpj, &secp256k1_scalar_zero, scalar_g);
     881 | +    }
    


    siv2r commented at 2:25 PM on December 26, 2025:

    We can probably remove this else block. The n=0 case is already handled by the call to ecmult_strauss_wnaf in the preceding if block. Plus, secp256k1_ecmult calls strauss_wnaf internally, so this seems redundant.


    fjahr commented at 4:27 PM on January 20, 2026:

    You are correct! Removing it.

  10. in src/ecmult_impl.h:598 in 8760b3c148 outdated
     648 | + * Index 2-13 = PIPPENGER with window sizes 1-12
     649 | + */
     650 | +static const struct secp256k1_ecmult_multi_abcd secp256k1_ecmult_multi_abcds[SECP256K1_ECMULT_MULTI_NUM_ALGOS] = {
     651 | +/*  A (per-point bytes)                     B (fixed bytes)                        C        D           */
     652 | +    {0,                                     0,                                     1000,    0      },
     653 | +    {SECP256K1_STRAUSS_POINT_SIZE,          0,                                     100,     0      },
    


    siv2r commented at 9:00 AM on December 27, 2025:

    D shouldn't be 0 here right?


    fjahr commented at 5:23 PM on January 25, 2026:

    Yeah, I saw some wild jumps in the D value and just set it to zero manually because it didn't seem to matter much. I am pretty sure I wrote a comment at the time but then probably it got lost when I updated the values again at some point later.

  11. in src/bench_ecmult.c:76 in 8b62524b0f outdated
      71 | + * Measures the performance of each algorithm at various batch sizes and
      72 | + * outputs. Use tools/ecmult_multi_calib.py to calculate optimal C and D
      73 | + * values from the output.
      74 | + */
      75 | +static void run_ecmult_multi_calib(bench_data* data) {
      76 | +    static const size_t batch_sizes[] = {10, 20, 30, 50, 75, 100, 150, 200, 300, 500, 750, 1000, 1500, 2000, 3000, 5000, 7500, 10000, 15000, 20000, 30000};
    


    siv2r commented at 9:17 AM on December 27, 2025:

    I think it woulbe nice to have more batch sizes in the crossover region of strauss and pippenger (~80-150 points). So, something like this:

        static const size_t batch_sizes[] = {
            /* Small (Strauss region) */
            5, 10, 15, 20, 30, 50, 70,
            /* Crossover region */
            85, 88, 90, 100, 120, 150, 175,
            /* Medium (Pippenger small windows, w=6..8) */
            200, 300, 500, 750, 1000, 1200,
            /* Large (Pippenger large windows, w=9..12) */
            1500, 2000, 3000, 5000, 7500,
            10000, 15000, 20000, 30000
        };
    

    siv2r commented at 9:26 AM on December 27, 2025:

    We should probably cap the batch size for Strauss calibration (7,500 or less?). At higher sizes, the benchmark results curve and become outliers, which will skew the linear regression model. Since ecmult_multi_select won't choose Strauss for large batches anyway, we can avoid these sizes. I've attached a visualization below.

    <img width="640" height="480" alt="STRAUSS" src="https://github.com/user-attachments/assets/b251b0c9-1323-4e6e-8f2f-eee6246c7e26" />


    siv2r commented at 6:42 PM on March 26, 2026:

    <img width="640" height="480" alt="STRAUSS" src="https://github.com/user-attachments/assets/5945dd94-ef36-4752-96a6-8e16322b1c23" />

    The Strauss fit improved. More info: #1789 (comment)

  12. siv2r commented at 9:38 AM on December 27, 2025: contributor

    I ran the calibration tool, and I'm getting the following results. It's very close the exisitng results, uptil PIPPENGER_8 after which the C value increases for me (this shouldn't happen right?). I'll run this a few more times.

    <details> <summary>My ABCD values</summary>

    static const struct secp256k1_ecmult_multi_abcd secp256k1_ecmult_multi_abcds[SECP256K1_ECMULT_MULTI_NUM_ALGOS] = {
        {0,                                     0,                                     1000,  0     },
        {SECP256K1_STRAUSS_POINT_SIZE,          0,                                     113,   143  },
        {SECP256K1_PIPPENGER_POINT_SIZE(1),  SECP256K1_PIPPENGER_FIXED_SIZE(1),  199,   303  },
        {SECP256K1_PIPPENGER_POINT_SIZE(2),  SECP256K1_PIPPENGER_FIXED_SIZE(2),  152,   460  },
        {SECP256K1_PIPPENGER_POINT_SIZE(3),  SECP256K1_PIPPENGER_FIXED_SIZE(3),  117,   782  },
        {SECP256K1_PIPPENGER_POINT_SIZE(4),  SECP256K1_PIPPENGER_FIXED_SIZE(4),  100,   1158 },
        {SECP256K1_PIPPENGER_POINT_SIZE(5),  SECP256K1_PIPPENGER_FIXED_SIZE(5),  86,    1837 },
        {SECP256K1_PIPPENGER_POINT_SIZE(6),  SECP256K1_PIPPENGER_FIXED_SIZE(6),  77,    3013 },
        {SECP256K1_PIPPENGER_POINT_SIZE(7),  SECP256K1_PIPPENGER_FIXED_SIZE(7),  72,    4845 },
        {SECP256K1_PIPPENGER_POINT_SIZE(8),  SECP256K1_PIPPENGER_FIXED_SIZE(8),  69,    8775 },
        {SECP256K1_PIPPENGER_POINT_SIZE(9),  SECP256K1_PIPPENGER_FIXED_SIZE(9),  73,    14373},
        {SECP256K1_PIPPENGER_POINT_SIZE(10), SECP256K1_PIPPENGER_FIXED_SIZE(10), 78,    26442},
        {SECP256K1_PIPPENGER_POINT_SIZE(11), SECP256K1_PIPPENGER_FIXED_SIZE(11), 80,    48783},
        {SECP256K1_PIPPENGER_POINT_SIZE(12), SECP256K1_PIPPENGER_FIXED_SIZE(12), 106,   88289},
    };
    

    </details>

    I’ve visualized the benchmarks for all algorithms (see https://github.com/siv2r/secp256k1/commit/311938612268866bd3dd0cf6c1d006e418518b1d for the diagrams). I’m planning to do a deeper dive into the regression model fit to ensure they align with these results. Will share any useful findings here

  13. siv2r commented at 2:11 PM on December 27, 2025: contributor

    I did some analysis on the linear regression model for the CD values (see https://github.com/siv2r/secp256k1/commit/b5985a60454e68c7f88bfe7a9d41b89a47f2b530). I basically ran ./bench_ecmult calib twice. Used the first run to compute the linear regression model, and the second run to test it against new benchmark values.

    <details> <summary>ABCD Values</summary>

    static const struct secp256k1_ecmult_multi_abcd secp256k1_ecmult_multi_abcds[SECP256K1_ECMULT_MULTI_NUM_ALGOS] = {
        {0,                                     0,                                     1000,  0     },
        {SECP256K1_STRAUSS_POINT_SIZE,          0,                                     112,   173  },
        {SECP256K1_PIPPENGER_POINT_SIZE(1),  SECP256K1_PIPPENGER_FIXED_SIZE(1),  197,   285  },
        {SECP256K1_PIPPENGER_POINT_SIZE(2),  SECP256K1_PIPPENGER_FIXED_SIZE(2),  152,   480  },
        {SECP256K1_PIPPENGER_POINT_SIZE(3),  SECP256K1_PIPPENGER_FIXED_SIZE(3),  117,   767  },
        {SECP256K1_PIPPENGER_POINT_SIZE(4),  SECP256K1_PIPPENGER_FIXED_SIZE(4),  100,   1167 },
        {SECP256K1_PIPPENGER_POINT_SIZE(5),  SECP256K1_PIPPENGER_FIXED_SIZE(5),  86,    1887 },
        {SECP256K1_PIPPENGER_POINT_SIZE(6),  SECP256K1_PIPPENGER_FIXED_SIZE(6),  78,    3023 },
        {SECP256K1_PIPPENGER_POINT_SIZE(7),  SECP256K1_PIPPENGER_FIXED_SIZE(7),  73,    4906 },
        {SECP256K1_PIPPENGER_POINT_SIZE(8),  SECP256K1_PIPPENGER_FIXED_SIZE(8),  70,    8889 },
        {SECP256K1_PIPPENGER_POINT_SIZE(9),  SECP256K1_PIPPENGER_FIXED_SIZE(9),  74,    14544},
        {SECP256K1_PIPPENGER_POINT_SIZE(10), SECP256K1_PIPPENGER_FIXED_SIZE(10), 79,    26764},
        {SECP256K1_PIPPENGER_POINT_SIZE(11), SECP256K1_PIPPENGER_FIXED_SIZE(11), 84,    49179},
        {SECP256K1_PIPPENGER_POINT_SIZE(12), SECP256K1_PIPPENGER_FIXED_SIZE(12), 106,   89518},
    };
    

    </details>

    <details> <summary>Statistical analysis </summary>

    Metric Definitions

    Metric Meaning Good Value for Benchmarking
    How well the linear model fits (0-1) > 0.95 (strong linear relationship)
    Std Error Uncertainty in slope estimate Low relative to slope
    p-value Probability slope = 0 by chance < 0.05 (statistically significant)
    Slope Time increase per additional n (μs) Positive, algorithm-dependent
    Intercept Fixed overhead time (μs) Algorithm-dependent

    Per-Algorithm Metrics

    Algorithm Std Error p-value Slope Intercept
    STRAUSS 0.983536 0.1660 1.29e-25 6.2956 -1917.23
    PIPPENGER_1 0.999995 0.0039 6.29e-73 8.5860 -133.20
    PIPPENGER_2 0.999992 0.0036 9.98e-71 6.7066 -115.04
    PIPPENGER_3 0.999993 0.0026 8.15e-71 4.9746 -34.11
    PIPPENGER_4 0.999996 0.0015 5.34e-75 4.1772 -13.27
    PIPPENGER_5 0.999995 0.0015 1.16e-73 3.5473 56.56
    PIPPENGER_6 0.999991 0.0019 2.30e-69 3.1035 105.03
    PIPPENGER_7 0.999994 0.0013 4.69e-72 2.7083 219.62
    PIPPENGER_8 0.999989 0.0016 2.64e-68 2.4879 414.58
    PIPPENGER_9 0.999977 0.0021 3.76e-64 2.2872 719.46
    PIPPENGER_10 0.999953 0.0028 5.02e-60 2.1705 1287.92
    PIPPENGER_11 0.999847 0.0049 4.73e-53 2.0592 2321.53
    PIPPENGER_12 0.999583 0.0075 3.62e-47 1.9331 4153.43

    </details>

    The linear fit ($R^2$) and standard error for STRAUSS is relatively worse. We can improve this by using smaller batch sizes when calibrating Strauss. I see negative intercepts (i.e., D values) for PIPPENGER_1..4, this should be positive. I think this could also be address by capping these algorithms to medium batch sizes during calibration. Otherwise, the calibrated values are excellent!

  14. fjahr force-pushed on Jan 20, 2026
  15. fjahr force-pushed on Jan 20, 2026
  16. fjahr commented at 4:29 PM on January 20, 2026: contributor

    Addressed the comments on the implementation code and rebased, thanks a lot for the review @siv2r ! I need a little more time to think about the calibration again, but will comment on that asap.

  17. fjahr force-pushed on Jan 25, 2026
  18. fjahr commented at 5:22 PM on January 25, 2026: contributor

    @siv2r Thanks again for the review! The hint about using more narrow ranges for each of the algos in the calibration was awesome, I didn't think of this before and it pretty much fixed all the problems I ran into with the calibration results. I took these suggestions with some smaller additional tweaks, specifically with Strauss which still had quite a bit of variance. I see pretty stable calibration results between multiple runs now and the measurements between the different benchmarks are running smoother than any of my previous iterations.

    These improvements have given me a lot more confidence in these changes, so taking this out of draft status too.

  19. fjahr marked this as ready for review on Jan 25, 2026
  20. fjahr force-pushed on Feb 3, 2026
  21. fjahr commented at 10:37 AM on February 3, 2026: contributor

    Randomly found #638 when looking at older PRs and found a small bug in my pippenger size calculation from comparing the overlapping code.

  22. fjahr force-pushed on Feb 3, 2026
  23. in src/ecmult.h:79 in 4b059914c4
      84 | + * For each algorithm, memory usage is modeled as m(x) = A*x + B and
      85 | + * running time as c(x) = C*x + D, where x is the batch size. This
      86 | + * function finds the algorithm that minimizes time per operation
      87 | + * C + D/x at the maximum batch size x = (mem_limit - B) / A.
      88 | + *
      89 | + * Returns: The optimal batch size, or 0 if memory is insufficient.
    


    siv2r commented at 4:33 PM on March 23, 2026:

    This function doesn't return 0 when memory is insufficient. It returns ECMULT_MAX_POINTS_PER_BATCH. We need to update the docs here. Something like:

    The optimal batch size. Falls back to ECMULT_MAX_POINTS_PER_BATCH 
    using the trivial algorithm when no Strauss/Pippenger algorithm fits.
    

    fjahr commented at 10:34 AM on March 24, 2026:

    Thanks, I fixed it and took your suggestion.

  24. fjahr force-pushed on Mar 24, 2026
  25. fjahr commented at 10:35 AM on March 24, 2026: contributor

    Addressed @siv2r 's feedback and rebased, I had kind of lost track of that, sorry.

  26. fjahr force-pushed on Mar 24, 2026
  27. in src/ecmult_impl.h:848 in 5050b5e097
    1006 | +    const secp256k1_scalar *scalars,
    1007 | +    const secp256k1_scalar *scalar_g,
    1008 | +    size_t mem_limit
    1009 | +) {
    1010 | +    secp256k1_ecmult_multi_algo algo = secp256k1_ecmult_multi_select(mem_limit, n_points);
    1011 | +    return secp256k1_ecmult_multi_internal(error_callback, algo, r, n_points, points, scalars, scalar_g);
    


    siv2r commented at 9:30 AM on March 26, 2026:

    The old secp256k1_ecmult_multi_var had an internal batch loop: it split n points into ceil(n / max_points_per_batch) chunks based on scratch capacity, ran Strauss or Pippenger on each chunk, and accumulated results. The new secp256k1_ecmult_multi passes n_points directly as batch_size to secp256k1_ecmult_multi_select with no splitting. All points are processed in one shot.

    This means when n_points * A + B > mem_limit for every algorithm, select falls back to TRIVIAL. On a 64-bit system, Strauss needs ~1,414 bytes/point and even the cheapest Pippenger (window 1) needs ~760 bytes/point. So for test_ecmult_multi_batching with n_points = 200:

    mem_limit Strauss (needs 283 KB for 200 pts) Pippenger w1 (needs 153 KB for 200 pts) Selected
    1 KB exceeds exceeds TRIVIAL
    4 KB exceeds exceeds TRIVIAL
    16 KB exceeds exceeds TRIVIAL
    64 KB exceeds exceeds TRIVIAL
    256 KB exceeds fits Pippenger
    1 MB+ fits fits best fit

    The old test (176 points, scratch sized from 1 to 176) used TRIVIAL only 2 out of 178 calls (the two explicit edge cases), and every loop iteration used Strauss or Pippenger via batching. The new test hits TRIVIAL for 4 of 7 memory limits, it still passes (TRIVIAL is correct), but no longer exercises the efficient algorithms under constrained memory.

    One fix: re-introduce the batch loop in secp256k1_ecmult_multi using the existing secp256k1_ecmult_multi_batch_size to compute chunk size, then call secp256k1_ecmult_multi_internal per chunk. This also means modules that want to manage their own batching (e.g., batch verification) could call _internal directly.

    Alternatively, if callers are always expected to provide sufficient mem_limit, the current design works. Since ecmult_multi isn't a public API, that's not unreasonable. But most callers probably won't know what "sufficient memory" means for a given n_points, they'd expect ecmult_multi to handle that for them (see also this discussion about exposing mem_limit to musig callers). I'm slightly leaning toward letting ecmult_multi do the batching like the old code, but will probably know more once caller requirements become clearer.


    siv2r commented at 12:12 PM on March 26, 2026:

    One downside of having ecmult_multi do internal batching (like the old code) is that callers lose visibility into which algorithm is actually being used. In particular, it's hard to detect a silent fallback to TRIVIAL. The current design avoids this since callers can call _select beforehand to inspect the choice.

    That said, I think this isn't a problem. If ecmult_multi handled batching internally, callers who just want the result would use it directly. Callers who want more control (inspect the algorithm, force a specific one, handle batching themselves) would use _select + _multi_internal instead.

    Right now, ecmult_multi is basically a thin wrapper over _select + _multi_internal, so adding batching to it would give the two functions more distinct roles. What do you think?


    fjahr commented at 9:32 PM on April 3, 2026:

    You are making a good argument and I agree, I oversimplified things not keeping memory constraint environments in mind here. I have introduced the batch loop. If possible it still tries to one-shot but if not it basically matches the behavior from secp256k1_ecmult_multi_var in master.

  28. in src/ecmult_impl.h:572 in 5050b5e097
     611 | -    }
     612 | +/* Strauss per-point memory */
     613 | +#define SECP256K1_STRAUSS_POINT_SIZE \
     614 | +    ((sizeof(secp256k1_ge) + sizeof(secp256k1_fe)) * ECMULT_TABLE_SIZE(WINDOW_A) \
     615 | +     + sizeof(struct secp256k1_strauss_point_state) \
     616 | +     + sizeof(secp256k1_gej) + sizeof(secp256k1_scalar))
    


    siv2r commented at 10:32 AM on March 26, 2026:
    #define SECP256K1_STRAUSS_POINT_SIZE \
        ((sizeof(secp256k1_ge) + sizeof(secp256k1_fe)) * ECMULT_TABLE_SIZE(WINDOW_A) \
         + sizeof(struct secp256k1_strauss_point_state) \
         + sizeof(secp256k1_gej))
    

    We should remove sizeof(secp256k1_scalar), as ecmult_multi_strauss now passes the scalars pointer directly to ecmult_multi_strauss_wnaf without allocating it on the heap (the old code used to do this)


    fjahr commented at 9:32 PM on April 3, 2026:

    Right, fixed.

  29. in src/bench_ecmult.c:154 in 5050b5e097
     149 | +        if (n >= 300) iters = base_iters / 2;
     150 | +        if (iters < 100) iters = 100;
     151 | +
     152 | +        t_start = gettime_i64();
     153 | +        for (iter = 0; iter < iters; iter++) {
     154 | +            secp256k1_ecmult_multi_internal(&data->ctx->error_callback, algo,
    


    siv2r commented at 11:05 AM on March 26, 2026:

    Should we wrap this in CHECK()?


    fjahr commented at 9:32 PM on April 3, 2026:

    Makes sense, done.

  30. in src/bench_ecmult.c:189 in 5050b5e097
     184 | +            if (n >= 15000) iters = base_iters / 100;
     185 | +            if (iters < 3) iters = 3;
     186 | +
     187 | +            t_start = gettime_i64();
     188 | +            for (iter = 0; iter < iters; iter++) {
     189 | +                secp256k1_ecmult_multi_internal(&data->ctx->error_callback, algo,
    


    siv2r commented at 11:06 AM on March 26, 2026:

    Should we wrap this in CHECK()?


    fjahr commented at 9:32 PM on April 3, 2026:

    Makes sense, done.

  31. in src/ecmult.h:96 in 5050b5e097 outdated
     102 | + * Among algorithms that fit within mem_limit for the given batch_size,
     103 | + * selects the one that minimizes time per operation C + D/batch_size.
     104 | + *
     105 | + * Returns: The optimal algorithm identifier.
     106 | + */
     107 | +static secp256k1_ecmult_multi_algo secp256k1_ecmult_multi_select(
    


    siv2r commented at 12:16 PM on March 26, 2026:

    nit: I think it'd be nice to document that _batch_size and _select may disagree on their choice of optimial algorithm.


    fjahr commented at 9:32 PM on April 3, 2026:

    Ok, added a comment

  32. in src/ecmult_impl.h:906 in 5050b5e097 outdated
    1064 | +                                                    points, scalars, scalar_g, 11);
    1065 | +        case SECP256K1_ECMULT_MULTI_ALGO_PIPPENGER_12:
    1066 | +            return secp256k1_ecmult_multi_pippenger(error_callback, r, n_points,
    1067 | +                                                    points, scalars, scalar_g, 12);
    1068 | +        default:
    1069 | +            return secp256k1_ecmult_multi_trivial(r, n_points, points, scalars, scalar_g);
    


    siv2r commented at 12:34 PM on March 26, 2026:
            default:
                VERIFY_CHECK(0); /* unreachable with valid algo values */
                return secp256k1_ecmult_multi_trivial(r, n_points, points, scalars, scalar_g);
    

    Since the enum covers exactly 14 values (0-13) and _select only returns valid values, this branch should be unreachable. If it ever runs, it likely means a bug, and silently falling back to TRIVIAL would make that very hard to notice. This would catch bugs in debug builds while keeping the fallback in the default build.


    fjahr commented at 9:32 PM on April 3, 2026:

    Makes sense, added!

  33. in src/tests.c:1 in 5050b5e097 outdated


    siv2r commented at 2:15 PM on March 26, 2026:

    The PR description acknowledges that test coverage is currently lower than before. Here's a list that might be helpful when adding tests back:

    • Test each algorithm individually via _multi_internal (TRIVIAL, STRAUSS, PIPPENGER 1-12), verifying results against secp256k1_ecmult for small inputs (n=0, 1, 2)
    • Structured edge case tests: all-infinity points, all-zero scalars, cancelling pairs (negated scalar with same point, same scalar with negated point), sum-to-zero across multiple points
    • Boundary tests for _batch_size: mem_limit=0 should return ECMULT_MAX_POINTS_PER_BATCH (TRIVIAL fallback), mem_limit=1 similarly. For larger values, verify the returned batch_size leads to a successful _multi call
    • Property tests for _select: with large mem_limit, _select(mem_limit, 10) should pick STRAUSS (fastest for tiny batches), _select(mem_limit, 100000) should pick a PIPPENGER variant (fastest for large batches), _select(0, n) should return TRIVIAL for any n.
      • I'm wondering if we can also define a range of points where each algorithm transitions to the next, and assert those transitions

    fjahr commented at 9:35 PM on April 3, 2026:

    Thanks, yeah, I had planned to get back to this and add some more tests. I think I was able to cover all your suggestions but I have skipped exploring coverage for the exact transitions since they seemed brittle and we might still recalibrate. But open to revisit this if other reviewers think this would be valuable to add once we had a few more eyes on the current calibration values.

  34. siv2r commented at 6:41 PM on March 26, 2026: contributor

    I re-ran my old analysis on the new ABCD values. The results are very positive, and there's clear improvement in STRAUSS values, the linear fit is better now (0.983 -> 0.998).

    <details> <summary>ABCD Values</summary>

    static const struct secp256k1_ecmult_multi_abcd secp256k1_ecmult_multi_abcds[SECP256K1_ECMULT_MULTI_NUM_ALGOS] = {
        {0,                                     0,                                     1000,  0     },
        {SECP256K1_STRAUSS_POINT_SIZE,          0,                                     103,   274  },
        {SECP256K1_PIPPENGER_POINT_SIZE(1),  SECP256K1_PIPPENGER_FIXED_SIZE(1),  195,   417  },
        {SECP256K1_PIPPENGER_POINT_SIZE(2),  SECP256K1_PIPPENGER_FIXED_SIZE(2),  147,   602  },
        {SECP256K1_PIPPENGER_POINT_SIZE(3),  SECP256K1_PIPPENGER_FIXED_SIZE(3),  117,   888  },
        {SECP256K1_PIPPENGER_POINT_SIZE(4),  SECP256K1_PIPPENGER_FIXED_SIZE(4),  100,   1390 },
        {SECP256K1_PIPPENGER_POINT_SIZE(5),  SECP256K1_PIPPENGER_FIXED_SIZE(5),  86,    2227 },
        {SECP256K1_PIPPENGER_POINT_SIZE(6),  SECP256K1_PIPPENGER_FIXED_SIZE(6),  76,    3755 },
        {SECP256K1_PIPPENGER_POINT_SIZE(7),  SECP256K1_PIPPENGER_FIXED_SIZE(7),  67,    6626 },
        {SECP256K1_PIPPENGER_POINT_SIZE(8),  SECP256K1_PIPPENGER_FIXED_SIZE(8),  62,    10976},
        {SECP256K1_PIPPENGER_POINT_SIZE(9),  SECP256K1_PIPPENGER_FIXED_SIZE(9),  57,    19909},
        {SECP256K1_PIPPENGER_POINT_SIZE(10), SECP256K1_PIPPENGER_FIXED_SIZE(10), 54,    36574},
        {SECP256K1_PIPPENGER_POINT_SIZE(11), SECP256K1_PIPPENGER_FIXED_SIZE(11), 50,    72630},
        {SECP256K1_PIPPENGER_POINT_SIZE(12), SECP256K1_PIPPENGER_FIXED_SIZE(12), 46,    135404},
    };
    

    </details>

    <details> <summary>Statistical analysis</summary>

    Metric Definitions

    Metric Meaning Good Value for Benchmarking
    How well the linear model fits (0-1) > 0.95 (strong linear relationship)
    Std Error Uncertainty in slope estimate Low relative to slope
    p-value Probability slope = 0 by chance < 0.05 (statistically significant)
    Slope Time increase per additional n (μs) Positive, algorithm-dependent
    Intercept Fixed overhead time (μs) Algorithm-dependent

    Per-Algorithm Metrics

    Algorithm Std Error p-value Slope Intercept
    STRAUSS 0.998256 0.0468 2.77e-26 4.5477 -15.70
    PIPPENGER_1 0.999981 0.0110 6.54e-25 7.6313 16.23
    PIPPENGER_2 0.999932 0.0131 1.38e-30 5.7699 21.74
    PIPPENGER_3 0.999973 0.0066 1.98e-33 4.6724 27.37
    PIPPENGER_4 0.999987 0.0040 3.30e-33 3.9572 45.39
    PIPPENGER_5 0.999988 0.0033 9.93e-36 3.4059 80.97
    PIPPENGER_6 0.999979 0.0044 1.02e-24 2.9971 123.83
    PIPPENGER_7 0.999960 0.0057 4.08e-21 2.6388 220.59
    PIPPENGER_8 0.999989 0.0027 1.18e-23 2.4487 353.68
    PIPPENGER_9 0.999995 0.0019 8.61e-20 2.2403 700.71
    PIPPENGER_10 0.999992 0.0028 7.05e-14 2.1095 1444.03
    PIPPENGER_11 0.999914 0.0109 3.41e-07 1.9767 2804.01
    PIPPENGER_12 0.999997 0.0033 1.14e-03 1.8235 5310.41

    </details>

  35. in src/ecmult_impl.h:598 in 5050b5e097 outdated
     644 | + * Index 2-13 = PIPPENGER with window sizes 1-12
     645 | + */
     646 | + static const struct secp256k1_ecmult_multi_abcd secp256k1_ecmult_multi_abcds[SECP256K1_ECMULT_MULTI_NUM_ALGOS] = {
     647 | +/*  A (per-point bytes)                  B (fixed bytes)                     C      D     */
     648 | +    {0,                                  0,                                  1000,  0     },
     649 | +    {SECP256K1_STRAUSS_POINT_SIZE,       0,                                  109,   120   },
    


    siv2r commented at 6:45 PM on March 26, 2026:

    I ran benchmark a few times, and my Strauss's D value is twice this. It's in 250ish region. The CD values of the remaining algorithms match pretty well.


    siv2r commented at 5:58 PM on March 30, 2026:

    I take back this comment. The current CD values are good (see #1789 (comment)). Feel free to mark this as resolved.

  36. siv2r commented at 5:57 PM on March 30, 2026: contributor

    I was curious whether we could squeeze out better CD values by calibrating each algorithm on its most optimal batch range, instead of the broader ranges we currently use. So I ran all 14 ecmult_multi algorithms across [2, 50000] and plotted the per-point times to figure out which algorithm actually wins at each batch size (see the two toggles below for the full picture).

    <details> <summary>graph comparing all 14 algorithms for speedup</summary> <img width="1200" height="800" alt="speedup_vs_individual" src="https://github.com/user-attachments/assets/229b6f11-a580-4831-ac57-f3016a2b758d" />

    </details>

    <details> <summary>optimal algorithm for each batch size</summary> <img width="1200" height="800" alt="winners_speedup" src="https://github.com/user-attachments/assets/3d8d38fa-de6b-4586-b178-4417f57628f5" />

    Optimal Algorithm Ranges

    Optimal Range Best Algorithm Per-Point Time (us) Speedup vs Individual
    [2, 90] STRAUSS 4.40 - 6.97 1.40x - 2.27x
    [95, 130] PIPPENGER_5 4.12 - 4.37 2.30x - 2.44x
    [150, 250] PIPPENGER_6 3.67 - 4.01 2.51x - 2.76x
    [300, 750] PIPPENGER_7 3.02 - 3.56 2.84x - 3.37x
    [1000, 1500] PIPPENGER_8 2.74 - 2.88 3.59x - 3.77x
    [2000, 7500] PIPPENGER_9 2.34 - 2.63 3.91x - 4.42x
    [10000, 12500] PIPPENGER_11 2.24 - 2.29 4.53x - 4.62x
    [15000, 50000] PIPPENGER_12 1.96 - 2.20 4.71x - 5.29x

    Key Findings

    • Never optimal: PIPPENGER_1, PIPPENGER_2, PIPPENGER_3, PIPPENGER_4, PIPPENGER_10 — dominated by neighbors at every batch size.
    • Crossover points: STRAUSS -> PIPPENGER_5 at n~92; PIPPENGER_5 -> PIPPENGER_6 at n=140; PIPPENGER_6 -> PIPPENGER_7 at n=275; PIPPENGER_7 -> PIPPENGER_8 at n=875; PIPPENGER_8 -> PIPPENGER_9 at n=1750; PIPPENGER_9 -> PIPPENGER_11 at n=8750; PIPPENGER_11 -> PIPPENGER_12 at n=13750.

    Winner at each batch size

    N Best Algorithm Per-Point Time (us) Speedup
    2 STRAUSS 6.97 1.40x
    3 STRAUSS 6.12 1.60x
    4 STRAUSS 5.75 1.71x
    5 STRAUSS 5.46 1.80x
    7 STRAUSS 5.15 1.93x
    10 STRAUSS 4.85 2.06x
    13 STRAUSS 4.79 2.09x
    16 STRAUSS 4.61 2.15x
    20 STRAUSS 4.58 2.16x
    25 STRAUSS 4.58 2.18x
    30 STRAUSS 4.46 2.24x
    35 STRAUSS 4.46 2.23x
    40 STRAUSS 4.48 2.23x
    45 STRAUSS 4.44 2.25x
    50 STRAUSS 4.48 2.24x
    55 STRAUSS 4.40 2.27x
    60 STRAUSS 4.42 2.26x
    65 STRAUSS 4.42 2.27x
    70 STRAUSS 4.42 2.27x
    75 STRAUSS 4.41 2.27x
    80 STRAUSS 4.46 2.26x
    85 STRAUSS 4.45 2.26x
    88 PIPPENGER_5 4.45 2.26x
    90 STRAUSS 4.43 2.27x
    95 PIPPENGER_5 4.37 2.30x
    100 PIPPENGER_5 4.33 2.33x
    110 PIPPENGER_5 4.25 2.37x
    120 PIPPENGER_5 4.18 2.41x
    130 PIPPENGER_5 4.12 2.44x
    150 PIPPENGER_6 4.01 2.51x
    175 PIPPENGER_6 3.86 2.60x
    200 PIPPENGER_6 3.77 2.68x
    250 PIPPENGER_6 3.67 2.76x
    300 PIPPENGER_7 3.56 2.84x
    350 PIPPENGER_7 3.46 2.93x
    400 PIPPENGER_7 3.38 2.99x
    500 PIPPENGER_7 3.25 3.11x
    600 PIPPENGER_7 3.09 3.29x
    750 PIPPENGER_7 3.02 3.37x
    1000 PIPPENGER_8 2.88 3.59x
    1200 PIPPENGER_8 2.81 3.69x
    1500 PIPPENGER_8 2.74 3.77x
    2000 PIPPENGER_9 2.63 3.91x
    2500 PIPPENGER_9 2.56 4.05x
    3000 PIPPENGER_9 2.51 4.12x
    4000 PIPPENGER_9 2.46 4.20x
    5000 PIPPENGER_9 2.42 4.28x
    7500 PIPPENGER_10 2.34 4.42x
    10000 PIPPENGER_11 2.29 4.53x
    12500 PIPPENGER_11 2.24 4.62x
    15000 PIPPENGER_12 2.20 4.71x
    20000 PIPPENGER_12 2.12 4.90x
    25000 PIPPENGER_12 2.07 5.01x
    30000 PIPPENGER_12 2.03 5.09x
    40000 PIPPENGER_12 2.00 5.20x
    50000 PIPPENGER_12 1.96 5.29x

    </details>

    A few things stood out. First, pippenger w=1 through w=4 are never optimal at any batch size — after strauss, ecmult_multi jumps straight to pippenger w=5. The pippenger CD values in general are already well calibrated, which makes sense since the linear fit for those is near perfect (R^2 > 0.999).

    The interesting part was strauss. Strauss flatlines after about n=90. So when we calibrate using batches from [2, 500] as we currently do, we're fitting a line to data that stops being linear after 90. Including these flatline points pulls the regression intercept C upward (104 -> 109) and the slope D downward (184 -> 120) compared to what we'd get from fitting just the linear region [2, 90]. In other words, the non-linear tail inflates the C value.

    Surprisingly, that inflation is actually helpful. The ecmult_multi_select picks the method that minimizes C + D/n. With the current values (C=109, D=120 for strauss, C=86, D=2187 for pippenger w=5), the crossover n works out to (2187 - 120) / (109 - 86) = 89.9. Which lines up almost perfectly with the empirical data showing pippenger w=5 becomes faster around n=92 (in the second graph above). Also, the old code switched to pippenger at n = 88.

    When I tried to "fix" the linear fit by calibrating strauss only on [2, 90] where it's truly linear, I got a much better R^2 (0.9998 vs 0.983) but the crossover moved to n=110. That means ecmult_multi would keep picking strauss all the way through n=110, even though pippenger w=5 is clearly faster by n=95. Better statistical fit, worse algorithm selection.

    TL;DR the current CD values are near perfect. The slightly "bad" linear fit for strauss isn't a problem. It is exactly what pushes the crossover to pippenger at the correct batch size (at n=89).

  37. fjahr force-pushed on Apr 3, 2026
  38. ecmult: Refactor ecmult algo selection 5a4de114c8
  39. bench: Tooling for ecmult algo selection abcd calibration 7048d27bb7
  40. fjahr force-pushed on Apr 3, 2026
  41. fjahr commented at 9:50 PM on April 3, 2026: contributor

    Huge thanks for the helpful review comments @siv2r , I think I have addressed them all now.

Contributors

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin-core/secp256k1. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-18 17:15 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me