feefrac: 128-bit multiply support in MSVC #29758

pull sipa wants to merge 2 commits into bitcoin:master from sipa:202403_feefrac128_msvc changing 2 files +81 −2
  1. sipa commented at 3:33 pm on March 28, 2024: member

    Feerate comparisons in the recently (#29242) introduced FeeFrac type rely on multiplications between 32-bit and 64-bit integers. On 64-bit systems, hardware can do this natively. On GCC and Clang we can use the __int128 type for this, but on MSVC one needs to use the _mul128 or _mulh intrinsics instead. This PR adds the use of _mul128 which is available on x86_64 systems.

    Performance of these operations isn’t currently very important, but they will become crucial with cluster mempool.

    I have not tested this code myself, though it’s based on similar code in libsecp256k1 (see https://github.com/bitcoin-core/secp256k1/blob/v0.4.1/src/int128_struct_impl.h#L7L30).

  2. DrahtBot commented at 3:33 pm on March 28, 2024: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage

    For detailed information about the code coverage, see the test coverage report.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK theuni
    Concept ACK hebasto

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

  3. in src/util/feefrac.h:14 in 6988dd6083 outdated
    10@@ -11,6 +11,10 @@
    11 #include <span.h>
    12 #include <util/check.h>
    13 
    14+#if defined(_MSC_VER) && (defined(_M_X64) || defined(_M_ARM64))
    


    hebasto commented at 3:47 pm on March 28, 2024:
    We do not explicitly support ARM64 for Windows. But it won’t hurt, and is good for completeness.

    sipa commented at 3:49 pm on March 28, 2024:
    Oh if that’s not tested anywhere I should probably remove it. There would be no way to catch mistakes in it.

    hebasto commented at 3:51 pm on March 28, 2024:
    Yes, we have no tests for MSVC builds on ARM64.

    sipa commented at 5:33 pm on March 28, 2024:
    I’ve dropped the ARM64 code, and added some more unit tests.
  4. hebasto commented at 3:47 pm on March 28, 2024: member
    Concept ACK.
  5. sipa commented at 3:58 pm on March 28, 2024: member
    Hmm, does our CI not run the fuzz corpus when building with MSVC?
  6. dergoegge commented at 4:04 pm on March 28, 2024: member

    Hmm, does our CI not run the fuzz corpus when building with MSVC?

    Afaik it does not but it would probably make sense. We also do it in the MacOS job.

  7. hebasto commented at 4:04 pm on March 28, 2024: member

    Hmm, does our CI not run the fuzz corpus when building with MSVC?

    No, it does not. Even a fuzz binary is not built.

  8. sipa commented at 4:12 pm on March 28, 2024: member
    I think it would be very useful if we did. Fuzzing itself won’t work in MSVC, but I don’t think there is a technical reason why we can’t run the existing fuzz corpus.
  9. sipa force-pushed on Mar 28, 2024
  10. sipa force-pushed on Mar 28, 2024
  11. feefrac: 128-bit multiply support in MSVC 119f0a59d1
  12. feefrac: add some more unit tests 5fb70b5b72
  13. sipa force-pushed on Mar 30, 2024
  14. DrahtBot added the label CI failed on Apr 18, 2024
  15. DrahtBot removed the label CI failed on Apr 23, 2024
  16. fanquake referenced this in commit 3aaf7328eb on Apr 28, 2024
  17. in src/util/feefrac.h:68 in 119f0a59d1 outdated
    64+#elif defined(_MSC_VER) && defined(_M_X64)
    65+    static inline std::pair<int64_t, uint64_t> Mul(int64_t a, int32_t b) noexcept
    66+    {
    67+        // On 64-bit MSVC, use _mul128 intrinsic for wide multiplication.
    68+        std::pair<int64_t, uint64_t> ret;
    69+        ret.second = _mul128(a, b, &ret.first);
    


    theuni commented at 4:09 pm on May 28, 2024:

    To help other reviewers:

    0__int64 _mul128(
    1   __int64 Multiplier,
    2   __int64 Multiplicand,
    3   __int64 *HighProduct
    4);
    5
    6Return value
    7The low 64 bits of the product.
    

    So this should result in: ret = <high, low>

    Which matches the other impls.


    sipa commented at 4:47 pm on May 28, 2024:
    Note the weirdly signed return value __int64 here, which IMHO makes no sense whatsoever. Semantically, the return value is 264HighProduct + (uint64_t)ret.
  18. theuni approved
  19. theuni commented at 4:28 pm on May 28, 2024: member

    utACK 5fb70b5b72c96a7e640900de56393b0bdb9df37a

    (I didn’t look into the additional unit tests)

    Godbolt output here: https://godbolt.org/z/8Msz984bv

    Compiles down to roughly the same.

  20. DrahtBot requested review from hebasto on May 28, 2024
  21. in src/util/feefrac.h:64 in 5fb70b5b72
    56@@ -53,10 +57,19 @@ struct FeeFrac
    57 #ifdef __SIZEOF_INT128__
    58     static inline __int128 Mul(int64_t a, int32_t b) noexcept
    59     {
    60-        // If __int128 is available, use 128-bit wide multiply.
    61+        // If __int128 is available, use 128-bit wide multiplication.
    62         return __int128{a} * b;
    63     }
    64+#elif defined(_MSC_VER) && defined(_M_X64)
    65+    static inline std::pair<int64_t, uint64_t> Mul(int64_t a, int32_t b) noexcept
    


    hebasto commented at 10:01 am on June 3, 2024:
    Why do this Mul’s return type and MulFallback’s one differ?

    sipa commented at 10:54 am on June 3, 2024:
    That’s the point; they represent the 96-bit integer product in a different way.

    sipa commented at 11:43 am on June 3, 2024:

    It would be possible to use std::pair<int32_t, uint64_t> here instead (as in, that type would be big enough to store the result), but _mul128 returns 64-bit results, which match the register size of x86_64, so no conversion is needed.

    Also note that std::pair<int64_t, uint64_t> probably roughly corresponds to how __int128 is represented internally (the CPU has no 128-bit general-purpose registers, and the mul instruction returns two 64-bit registers).

  22. DrahtBot requested review from hebasto on Jun 3, 2024
  23. hebasto commented at 9:09 am on June 4, 2024: member

    I’ve add a benchmark for the FeeRateCompare function, which calls Mul twice.

    Here are results on Windows 11, Release configuration, which implies /O2 /Oi compile flags:

    • master branch @ 61de64df6790077857faba84796bb874b59c5d15:
     0> .\build_msvc\master\bench_bitcoin.exe -filter=FeefracMultipication
     1
     2|               ns/op |                op/s |    err% |     total | benchmark
     3|--------------------:|--------------------:|--------:|----------:|:----------
     4|                1.47 |      680,300,487.93 |    1.9% |      0.17 | `FeefracMultipication`
     5> .\build_msvc\master\bench_bitcoin.exe -filter=FeefracMultipication
     6
     7|               ns/op |                op/s |    err% |     total | benchmark
     8|--------------------:|--------------------:|--------:|----------:|:----------
     9|                1.47 |      679,903,607.75 |    1.1% |      0.18 | `FeefracMultipication`
    10> .\build_msvc\master\bench_bitcoin.exe -filter=FeefracMultipication
    11
    12|               ns/op |                op/s |    err% |     total | benchmark
    13|--------------------:|--------------------:|--------:|----------:|:----------
    14|                1.43 |      698,557,147.13 |    1.3% |      0.17 | `FeefracMultipication`
    
    • this PR:
     0> .\build_msvc\pr29758\bench_bitcoin.exe -filter=FeefracMultipication
     1
     2|               ns/op |                op/s |    err% |     total | benchmark
     3|--------------------:|--------------------:|--------:|----------:|:----------
     4|                1.53 |      654,804,351.62 |    0.7% |      0.18 | `FeefracMultipication`
     5> .\build_msvc\pr29758\bench_bitcoin.exe -filter=FeefracMultipication
     6
     7|               ns/op |                op/s |    err% |     total | benchmark
     8|--------------------:|--------------------:|--------:|----------:|:----------
     9|                1.52 |      659,374,380.82 |    0.7% |      0.18 | `FeefracMultipication`
    10> .\build_msvc\pr29758\bench_bitcoin.exe -filter=FeefracMultipication
    11
    12|               ns/op |                op/s |    err% |     total | benchmark
    13|--------------------:|--------------------:|--------:|----------:|:----------
    14|                1.51 |      660,797,183.37 |    0.6% |      0.18 | `FeefracMultipication`
    

    The performance worsened by ~4%.

    cc @sipsorcery


    FWIW, this benchmark shows <10% of performance improvement for __int128 implementation comparing to the naive one on Linux.

  24. in src/util/feefrac.h:69 in 5fb70b5b72
    65+    static inline std::pair<int64_t, uint64_t> Mul(int64_t a, int32_t b) noexcept
    66+    {
    67+        // On 64-bit MSVC, use _mul128 intrinsic for wide multiplication.
    68+        std::pair<int64_t, uint64_t> ret;
    69+        ret.second = _mul128(a, b, &ret.first);
    70+        return ret;
    


    hebasto commented at 10:45 am on June 4, 2024:

    on MSVC one needs to use the _mul128 or _mulh intrinsics instead

    From my research, it follows that __mulh seems more performant.

    This code:

    0        // On 64-bit MSVC, use __mulh intrinsic for wide multiplication.
    1        return {__mulh(a, b), std::bit_cast<uint64_t>(a) * b};
    

    gives the following numbers for the benchmark:

     0> build_msvc\x64\Release\bench_bitcoin.exe -filter=FeefracMultipication
     1
     2|               ns/op |                op/s |    err% |     total | benchmark
     3|--------------------:|--------------------:|--------:|----------:|:----------
     4|                1.41 |      707,119,862.93 |    2.6% |      0.17 | `FeefracMultipication`
     5> build_msvc\x64\Release\bench_bitcoin.exe -filter=FeefracMultipication
     6
     7|               ns/op |                op/s |    err% |     total | benchmark
     8|--------------------:|--------------------:|--------:|----------:|:----------
     9|                1.36 |      734,209,621.15 |    0.9% |      0.16 | `FeefracMultipication`
    10> build_msvc\x64\Release\bench_bitcoin.exe -filter=FeefracMultipication
    11
    12|               ns/op |                op/s |    err% |     total | benchmark
    13|--------------------:|--------------------:|--------:|----------:|:----------
    14|                1.39 |      720,775,228.66 |    3.1% |      0.17 | `FeefracMultipication`
    
  25. DrahtBot requested review from hebasto on Jun 4, 2024
  26. sipa commented at 5:36 pm on June 4, 2024: member
    @hebasto Would you mind benchmarking with smaller realistic numbers (say fee and size both between 0 and 2^30)? If the top limb is equal, it’s possible the naive code does worse.
  27. hebasto commented at 12:19 pm on June 6, 2024: member

    @sipa

    Would you mind benchmarking with smaller realistic numbers (say fee and size both between 0 and 2^30)?

    I switched to FastRandomContext::randbits(30).

    If the top limb is equal, it’s possible the naive code does worse.

    Numbers have not changed:

    • the master branch:
     0> build_msvc\x64\Release\bench_bitcoin.exe -filter=FeefracMultipication
     1
     2|               ns/op |                op/s |    err% |     total | benchmark
     3|--------------------:|--------------------:|--------:|----------:|:----------
     4|                1.47 |      682,496,667.82 |    1.1% |      0.17 | `FeefracMultipication`
     5> build_msvc\x64\Release\bench_bitcoin.exe -filter=FeefracMultipication
     6
     7|               ns/op |                op/s |    err% |     total | benchmark
     8|--------------------:|--------------------:|--------:|----------:|:----------
     9|                1.46 |      686,538,677.70 |    1.3% |      0.17 | `FeefracMultipication`
    10> build_msvc\x64\Release\bench_bitcoin.exe -filter=FeefracMultipication
    11
    12|               ns/op |                op/s |    err% |     total | benchmark
    13|--------------------:|--------------------:|--------:|----------:|:----------
    14|                1.43 |      701,448,254.82 |    1.5% |      0.17 | `FeefracMultipication`
    
    • this PR:
     0> build_msvc\x64\Release\bench_bitcoin.exe -filter=FeefracMultipication
     1
     2|               ns/op |                op/s |    err% |     total | benchmark
     3|--------------------:|--------------------:|--------:|----------:|:----------
     4|                1.55 |      646,069,682.69 |    0.8% |      0.18 | `FeefracMultipication`
     5> build_msvc\x64\Release\bench_bitcoin.exe -filter=FeefracMultipication
     6
     7|               ns/op |                op/s |    err% |     total | benchmark
     8|--------------------:|--------------------:|--------:|----------:|:----------
     9|                1.52 |      656,962,444.61 |    0.6% |      0.18 | `FeefracMultipication`
    10> build_msvc\x64\Release\bench_bitcoin.exe -filter=FeefracMultipication
    11
    12|               ns/op |                op/s |    err% |     total | benchmark
    13|--------------------:|--------------------:|--------:|----------:|:----------
    14|                1.53 |      653,076,477.59 |    0.9% |      0.18 | `FeefracMultipication`
    
  28. sipa commented at 1:07 am on June 12, 2024: member
    @hebasto Thanks. I was assuming this would be an obvious improvement, but if it isn’t, it’ll need some more investigation into what this is all compiled to. That’s not something I’m interested in doing for a half-supported architecture.
  29. sipa closed this on Jun 12, 2024


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-11-24 21:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me