feefrac: 128-bit multiply support in MSVC

sipa commented at 3:33 pm on March 28, 2024: member

Feerate comparisons in the recently (#29242) introduced FeeFrac type rely on multiplications between 32-bit and 64-bit integers. On 64-bit systems, hardware can do this natively. On GCC and Clang we can use the __int128 type for this, but on MSVC one needs to use the _mul128 or _mulh intrinsics instead. This PR adds the use of _mul128 which is available on x86_64 systems.

Performance of these operations isn’t currently very important, but they will become crucial with cluster mempool.

I have not tested this code myself, though it’s based on similar code in libsecp256k1 (see https://github.com/bitcoin-core/secp256k1/blob/v0.4.1/src/int128_struct_impl.h#L7L30).

DrahtBot commented at 3:33 pm on March 28, 2024: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	theuni
Concept ACK	hebasto

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

in src/util/feefrac.h:14 in 6988dd6083 outdated

10@@ -11,6 +11,10 @@
11 #include <span.h>
12 #include <util/check.h>
13 
14+#if defined(_MSC_VER) && (defined(_M_X64) || defined(_M_ARM64))

hebasto commented at 3:47 pm on March 28, 2024:

We do not explicitly support ARM64 for Windows. But it won’t hurt, and is good for completeness.

sipa commented at 3:49 pm on March 28, 2024:

Oh if that’s not tested anywhere I should probably remove it. There would be no way to catch mistakes in it.

hebasto commented at 3:51 pm on March 28, 2024:

Yes, we have no tests for MSVC builds on ARM64.

sipa commented at 5:33 pm on March 28, 2024:

I’ve dropped the ARM64 code, and added some more unit tests.

hebasto commented at 3:47 pm on March 28, 2024: member

Concept ACK.

sipa commented at 3:58 pm on March 28, 2024: member

Hmm, does our CI not run the fuzz corpus when building with MSVC?

dergoegge commented at 4:04 pm on March 28, 2024: member

Hmm, does our CI not run the fuzz corpus when building with MSVC?

Afaik it does not but it would probably make sense. We also do it in the MacOS job.

hebasto commented at 4:04 pm on March 28, 2024: member

Hmm, does our CI not run the fuzz corpus when building with MSVC?

No, it does not. Even a fuzz binary is not built.

sipa commented at 4:12 pm on March 28, 2024: member

I think it would be very useful if we did. Fuzzing itself won’t work in MSVC, but I don’t think there is a technical reason why we can’t run the existing fuzz corpus.

sipa force-pushed on Mar 28, 2024

feefrac: 128-bit multiply support in MSVC 119f0a59d1

feefrac: add some more unit tests 5fb70b5b72

sipa force-pushed on Mar 30, 2024

DrahtBot added the label CI failed on Apr 18, 2024

DrahtBot removed the label CI failed on Apr 23, 2024

fanquake referenced this in commit 3aaf7328eb on Apr 28, 2024

in src/util/feefrac.h:68 in 119f0a59d1 outdated

64+#elif defined(_MSC_VER) && defined(_M_X64)
65+    static inline std::pair<int64_t, uint64_t> Mul(int64_t a, int32_t b) noexcept
66+    {
67+        // On 64-bit MSVC, use _mul128 intrinsic for wide multiplication.
68+        std::pair<int64_t, uint64_t> ret;
69+        ret.second = _mul128(a, b, &ret.first);

theuni commented at 4:09 pm on May 28, 2024:

To help other reviewers:

0__int64 _mul128(
1   __int64 Multiplier,
2   __int64 Multiplicand,
3   __int64 *HighProduct
4);
5
6Return value
7The low 64 bits of the product.

So this should result in: ret = <high, low>

Which matches the other impls.

sipa commented at 4:47 pm on May 28, 2024:

Note the weirdly signed return value __int64 here, which IMHO makes no sense whatsoever. Semantically, the return value is 264HighProduct + (uint64_t)ret.

theuni approved

theuni commented at 4:28 pm on May 28, 2024: member

utACK 5fb70b5b72c96a7e640900de56393b0bdb9df37a

(I didn’t look into the additional unit tests)

Godbolt output here: https://godbolt.org/z/8Msz984bv

Compiles down to roughly the same.

DrahtBot requested review from hebasto on May 28, 2024

in src/util/feefrac.h:64 in 5fb70b5b72

56@@ -53,10 +57,19 @@ struct FeeFrac
57 #ifdef __SIZEOF_INT128__
58     static inline __int128 Mul(int64_t a, int32_t b) noexcept
59     {
60-        // If __int128 is available, use 128-bit wide multiply.
61+        // If __int128 is available, use 128-bit wide multiplication.
62         return __int128{a} * b;
63     }
64+#elif defined(_MSC_VER) && defined(_M_X64)
65+    static inline std::pair<int64_t, uint64_t> Mul(int64_t a, int32_t b) noexcept

hebasto commented at 10:01 am on June 3, 2024:

Why do this Mul’s return type and MulFallback’s one differ?

sipa commented at 10:54 am on June 3, 2024:

That’s the point; they represent the 96-bit integer product in a different way.

sipa commented at 11:43 am on June 3, 2024:

It would be possible to use std::pair<int32_t, uint64_t> here instead (as in, that type would be big enough to store the result), but _mul128 returns 64-bit results, which match the register size of x86_64, so no conversion is needed.

Also note that std::pair<int64_t, uint64_t> probably roughly corresponds to how __int128 is represented internally (the CPU has no 128-bit general-purpose registers, and the mul instruction returns two 64-bit registers).

DrahtBot requested review from hebasto on Jun 3, 2024

hebasto commented at 9:09 am on June 4, 2024: member

I’ve add a benchmark for the FeeRateCompare function, which calls Mul twice.

Here are results on Windows 11, Release configuration, which implies /O2 /Oi compile flags:

master branch @ 61de64df6790077857faba84796bb874b59c5d15:

 0> .\build_msvc\master\bench_bitcoin.exe -filter=FeefracMultipication
 1
 2|               ns/op |                op/s |    err% |     total | benchmark
 3|--------------------:|--------------------:|--------:|----------:|:----------
 4|                1.47 |      680,300,487.93 |    1.9% |      0.17 | `FeefracMultipication`
 5> .\build_msvc\master\bench_bitcoin.exe -filter=FeefracMultipication
 6
 7|               ns/op |                op/s |    err% |     total | benchmark
 8|--------------------:|--------------------:|--------:|----------:|:----------
 9|                1.47 |      679,903,607.75 |    1.1% |      0.18 | `FeefracMultipication`
10> .\build_msvc\master\bench_bitcoin.exe -filter=FeefracMultipication
11
12|               ns/op |                op/s |    err% |     total | benchmark
13|--------------------:|--------------------:|--------:|----------:|:----------
14|                1.43 |      698,557,147.13 |    1.3% |      0.17 | `FeefracMultipication`

this PR:

 0> .\build_msvc\pr29758\bench_bitcoin.exe -filter=FeefracMultipication
 1
 2|               ns/op |                op/s |    err% |     total | benchmark
 3|--------------------:|--------------------:|--------:|----------:|:----------
 4|                1.53 |      654,804,351.62 |    0.7% |      0.18 | `FeefracMultipication`
 5> .\build_msvc\pr29758\bench_bitcoin.exe -filter=FeefracMultipication
 6
 7|               ns/op |                op/s |    err% |     total | benchmark
 8|--------------------:|--------------------:|--------:|----------:|:----------
 9|                1.52 |      659,374,380.82 |    0.7% |      0.18 | `FeefracMultipication`
10> .\build_msvc\pr29758\bench_bitcoin.exe -filter=FeefracMultipication
11
12|               ns/op |                op/s |    err% |     total | benchmark
13|--------------------:|--------------------:|--------:|----------:|:----------
14|                1.51 |      660,797,183.37 |    0.6% |      0.18 | `FeefracMultipication`

The performance worsened by ~4%.

cc @sipsorcery

FWIW, this benchmark shows <10% of performance improvement for __int128 implementation comparing to the naive one on Linux.

in src/util/feefrac.h:69 in 5fb70b5b72

65+    static inline std::pair<int64_t, uint64_t> Mul(int64_t a, int32_t b) noexcept
66+    {
67+        // On 64-bit MSVC, use _mul128 intrinsic for wide multiplication.
68+        std::pair<int64_t, uint64_t> ret;
69+        ret.second = _mul128(a, b, &ret.first);
70+        return ret;

hebasto commented at 10:45 am on June 4, 2024:

on MSVC one needs to use the _mul128 or _mulh intrinsics instead

From my research, it follows that __mulh seems more performant.

This code:

0        // On 64-bit MSVC, use __mulh intrinsic for wide multiplication.
1        return {__mulh(a, b), std::bit_cast<uint64_t>(a) * b};

gives the following numbers for the benchmark:

 0> build_msvc\x64\Release\bench_bitcoin.exe -filter=FeefracMultipication
 1
 2|               ns/op |                op/s |    err% |     total | benchmark
 3|--------------------:|--------------------:|--------:|----------:|:----------
 4|                1.41 |      707,119,862.93 |    2.6% |      0.17 | `FeefracMultipication`
 5> build_msvc\x64\Release\bench_bitcoin.exe -filter=FeefracMultipication
 6
 7|               ns/op |                op/s |    err% |     total | benchmark
 8|--------------------:|--------------------:|--------:|----------:|:----------
 9|                1.36 |      734,209,621.15 |    0.9% |      0.16 | `FeefracMultipication`
10> build_msvc\x64\Release\bench_bitcoin.exe -filter=FeefracMultipication
11
12|               ns/op |                op/s |    err% |     total | benchmark
13|--------------------:|--------------------:|--------:|----------:|:----------
14|                1.39 |      720,775,228.66 |    3.1% |      0.17 | `FeefracMultipication`

DrahtBot requested review from hebasto on Jun 4, 2024

sipa commented at 5:36 pm on June 4, 2024: member

@hebasto Would you mind benchmarking with smaller realistic numbers (say fee and size both between 0 and 2^30)? If the top limb is equal, it’s possible the naive code does worse.

hebasto commented at 12:19 pm on June 6, 2024: member

@sipa

Would you mind benchmarking with smaller realistic numbers (say fee and size both between 0 and 2^30)?

I switched to FastRandomContext::randbits(30).

If the top limb is equal, it’s possible the naive code does worse.

Numbers have not changed:

the master branch:

 0> build_msvc\x64\Release\bench_bitcoin.exe -filter=FeefracMultipication
 1
 2|               ns/op |                op/s |    err% |     total | benchmark
 3|--------------------:|--------------------:|--------:|----------:|:----------
 4|                1.47 |      682,496,667.82 |    1.1% |      0.17 | `FeefracMultipication`
 5> build_msvc\x64\Release\bench_bitcoin.exe -filter=FeefracMultipication
 6
 7|               ns/op |                op/s |    err% |     total | benchmark
 8|--------------------:|--------------------:|--------:|----------:|:----------
 9|                1.46 |      686,538,677.70 |    1.3% |      0.17 | `FeefracMultipication`
10> build_msvc\x64\Release\bench_bitcoin.exe -filter=FeefracMultipication
11
12|               ns/op |                op/s |    err% |     total | benchmark
13|--------------------:|--------------------:|--------:|----------:|:----------
14|                1.43 |      701,448,254.82 |    1.5% |      0.17 | `FeefracMultipication`

this PR:

 0> build_msvc\x64\Release\bench_bitcoin.exe -filter=FeefracMultipication
 1
 2|               ns/op |                op/s |    err% |     total | benchmark
 3|--------------------:|--------------------:|--------:|----------:|:----------
 4|                1.55 |      646,069,682.69 |    0.8% |      0.18 | `FeefracMultipication`
 5> build_msvc\x64\Release\bench_bitcoin.exe -filter=FeefracMultipication
 6
 7|               ns/op |                op/s |    err% |     total | benchmark
 8|--------------------:|--------------------:|--------:|----------:|:----------
 9|                1.52 |      656,962,444.61 |    0.6% |      0.18 | `FeefracMultipication`
10> build_msvc\x64\Release\bench_bitcoin.exe -filter=FeefracMultipication
11
12|               ns/op |                op/s |    err% |     total | benchmark
13|--------------------:|--------------------:|--------:|----------:|:----------
14|                1.53 |      653,076,477.59 |    0.9% |      0.18 | `FeefracMultipication`

sipa commented at 1:07 am on June 12, 2024: member

@hebasto Thanks. I was assuming this would be an obvious improvement, but if it isn’t, it’ll need some more investigation into what this is all compiled to. That’s not something I’m interested in doing for a half-supported architecture.

sipa closed this on Jun 12, 2024

bitcoin locked this on Jun 12, 2025

feefrac: 128-bit multiply support in MSVC #29758

Code Coverage

Reviews