speed optimization #436

issue msotoodeh openend this issue on December 29, 2016
  1. msotoodeh commented at 7:38 pm on December 29, 2016: none

    I suggest following optimizations in the scalar_4x64_imp.h file:

    CHANGE:

     0#define muladd_fast(a,b) { \
     1    uint64_t tl, th; \
     2    { \
     3        uint128_t t = (uint128_t)a * b; \
     4        th = t >> 64;         /* at most 0xFFFFFFFFFFFFFFFE */ \
     5        tl = t; \
     6    } \
     7    c0 += tl;                 /* overflow is handled on the next line */ \
     8    th += (c0 < tl) ? 1 : 0;  /* at most 0xFFFFFFFFFFFFFFFF */ \
     9    c1 += th;                 /* never overflows by contract (verified in the next line) */ \
    10    VERIFY_CHECK(c1 >= th); \
    11}
    

    TO:

    0#define muladd_fast(a,b) { \
    1    uint64_t tl, th; \
    2    uint128_t t = (uint128_t)a * b + c0; \
    3    c0 = (uint64_t)t; \
    4    c1 += (uint64_t)(t >> 64);  /* never overflows by contract (verified in the next line) */ \
    5    VERIFY_CHECK(c1 >= t >> 64); \
    6}
    

    This is repeated in multiple macros.

  2. msotoodeh commented at 7:46 pm on December 29, 2016: none

    Better remove declaration of tl and th to avoid compiler warnings.

    0#define muladd_fast(a,b) { \
    1    uint128_t t = (uint128_t)a * b + c0; \
    2    c0 = (uint64_t)t; \
    3    c1 += (uint64_t)(t >> 64);  /* never overflows by contract (verified in the next line) */ \
    4    VERIFY_CHECK(c1 >= t >> 64); \
    5}
    
  3. bitcoin-core deleted a comment on Feb 20, 2019
  4. sipa commented at 4:00 am on February 21, 2019: contributor
    Sorry for the very slow response, but this looks correct. We don’t need the more complicated overflow handling when c2 is not affected.
  5. gmaxwell commented at 9:49 pm on February 25, 2019: contributor
    Does someone care to make a PR implementing this change? @msotoodeh maybe?
  6. real-or-random commented at 11:04 am on February 26, 2019: contributor
    I’ve pasted that into Godbolt a few days ago and it seems actually slower than the current version with -O2 and -O3 (minimally slower on gcc and somewhat slower on clang): https://godbolt.org/z/dndn44. For the full truth in the context of the rest of the code, we need a benchmark though.
  7. real-or-random cross-referenced this on Aug 7, 2020 from issue Use double-wide types for additions in scalar code by sipa

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin-core/secp256k1. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-01-24 06:15 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me