Add ASM optimizations for MuHash3072 #19181

pull fjahr wants to merge 1 commits into bitcoin:master from fjahr:csi-4-muhash-asm changing 1 files +47 −1
  1. fjahr commented at 8:22 pm on June 5, 2020: contributor
    Adds assembly optimizations for MuHash3072 which is used for the muhash calculation of the UTXO set hash.
  2. in src/crypto/muhash.cpp:154 in 722700ad36 outdated
    149+bool IsOverflow(const Num3072* d)
    150+{
    151+    for (int i = 1; i < Num3072::LIMBS - 1; ++i) {
    152+        if (d->limbs[i] != std::numeric_limits<Num3072::limb_type>::max()) return false;
    153+    }
    154+    if (d->limbs[0] <= std::numeric_limits<Num3072::limb_type>::max() - 1103717) return false;
    


    ysangkok commented at 8:56 pm on June 5, 2020:
    binding 1103717 to a name would make the code more readable, as the name documents the purpose of the number.

    fjahr commented at 10:05 pm on June 7, 2020:
    done, not sure if my naming is good though. But I added a comment that should be helpful.
  3. in src/crypto/muhash.cpp:84 in 722700ad36 outdated
    83+    c2 = 0; \
    84+    c1 = t >> Num3072::LIMB_SIZE; \
    85+    c0 = t; \
    86+}
    87+
    88+/* [c0,c1,c2] += n * [d0,d1,d2]. c2 is 0 initially */
    


    ysangkok commented at 8:58 pm on June 5, 2020:
    comment wrong, or at least inconsistent with other version

    sipa commented at 9:02 pm on June 5, 2020:
    The other one is wrong (c0 is incremented, but c2 is only assigned to, so it’s c2 that’s treated as 0 on input).

    fjahr commented at 10:04 pm on June 7, 2020:
    fixed
  4. DrahtBot added the label Build system on Jun 5, 2020
  5. DrahtBot added the label Tests on Jun 5, 2020
  6. DrahtBot added the label Utils/log/libs on Jun 5, 2020
  7. fjahr force-pushed on Jun 7, 2020
  8. DrahtBot commented at 11:49 pm on June 7, 2020: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage

    For detailed information about the code coverage, see the test coverage report.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Concept ACK Sjors

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    No conflicts as of last run.

  9. Sjors commented at 10:48 am on June 9, 2020: member
    I’ll post some bench results here after #19214 and when the SHA512 -> 256 switch is implemented.
  10. laanwj commented at 2:29 pm on June 9, 2020: member
    Awesome, thanks for working on this. Given how good compilers are nowadays it’s somewhat surprising to me that there’s still something to gain with implementing things manually in assembly without use of any special instruction sets.
  11. fjahr force-pushed on Jun 11, 2020
  12. DrahtBot added the label Needs rebase on Jul 30, 2020
  13. Sjors commented at 12:18 pm on December 15, 2021: member
    Concept ACK. Description needs an update. And this still needs a rebase.
  14. fjahr force-pushed on Dec 19, 2021
  15. fjahr commented at 5:52 pm on December 19, 2021: contributor

    Thanks for the nudge @Sjors and @DrahtBot !

    I have only done a plain rebase so far, i.e. I have not checked whether further optimizations may be interesting to look into. But while I ran the benchmarks again I noticed that it currently appears that this has slower benchmarks than master on my machine. mul and div operations appear to be almost 20% slower. I will try to test on some other machines soon but if someone else can give it a spin as well it would be a great help!

    0$ src/bench/bench_bitcoin -filter=MuHash.* -min_time=1000
    1
    2|               ns/op |                op/s |    err% |     total | benchmark
    3|--------------------:|--------------------:|--------:|----------:|:----------
    4|            7,184.06 |          139,197.00 |    0.8% |      1.07 | `MuHash`
    5|            6,138.95 |          162,894.31 |    0.1% |      1.05 | `MuHashDiv`
    6|            6,161.80 |          162,290.16 |    0.7% |      1.05 | `MuHashMul`
    7|            1,016.96 |          983,326.67 |    0.3% |      1.08 | `MuHashPrecompute`
    
    0$ src/bench/bench_bitcoin -filter=MuHash.* -min_time=1000
    1
    2|               ns/op |                op/s |    err% |     total | benchmark
    3|--------------------:|--------------------:|--------:|----------:|:----------
    4|            6,225.79 |          160,622.20 |    1.6% |      1.05 | `MuHash`
    5|            5,121.86 |          195,241.46 |    0.5% |      1.09 | `MuHashDiv`
    6|            5,173.32 |          193,299.36 |    1.5% |      1.11 | `MuHashMul`
    7|            1,023.41 |          977,127.42 |    0.7% |      1.08 | `MuHashPrecompute`
    
  16. DrahtBot removed the label Needs rebase on Dec 19, 2021
  17. maflcko removed the label Build system on Dec 20, 2021
  18. maflcko removed the label Tests on Dec 20, 2021
  19. maflcko removed the label Utils/log/libs on Dec 20, 2021
  20. DrahtBot added the label Utils/log/libs on Dec 20, 2021
  21. maflcko commented at 5:00 pm on December 20, 2021: member

    With gcc 11.2.0 on Cortex-A72.

    This:

     0src/bench/bench_bitcoin -filter=MuHash.* 
     1Warning, results might be unstable:
     2* CPU frequency scaling enabled: CPU 0 between 600.0 and 1,500.0 MHz
     3* CPU governor is 'ondemand' but should be 'performance'
     4* Turbo is enabled, CPU frequency will fluctuate
     5
     6Recommendations
     7* Use 'pyperf system tune' before benchmarking. See https://github.com/psf/pyperf
     8
     9|               ns/op |                op/s |    err% |     total | benchmark
    10|--------------------:|--------------------:|--------:|----------:|:----------
    11|           27,412.56 |           36,479.63 |    0.1% |      0.01 | `MuHash`
    12|           24,505.61 |           40,806.99 |    0.0% |      0.01 | `MuHashDiv`
    13|           24,513.39 |           40,794.02 |    0.0% |      0.01 | `MuHashMul`
    14|            2,874.73 |          347,859.09 |    0.0% |      0.01 | `MuHashPrecompute`
    

    Master:

     0src/bench/bench_bitcoin -filter=MuHash.* 
     1Warning, results might be unstable:
     2* CPU frequency scaling enabled: CPU 0 between 600.0 and 1,500.0 MHz
     3* CPU governor is 'ondemand' but should be 'performance'
     4* Turbo is enabled, CPU frequency will fluctuate
     5
     6Recommendations
     7* Use 'pyperf system tune' before benchmarking. See https://github.com/psf/pyperf
     8
     9|               ns/op |                op/s |    err% |     total | benchmark
    10|--------------------:|--------------------:|--------:|----------:|:----------
    11|           27,451.77 |           36,427.52 |    0.0% |      0.01 | `MuHash`
    12|           24,524.13 |           40,776.16 |    0.1% |      0.01 | `MuHashDiv`
    13|           24,494.58 |           40,825.36 |    0.0% |      0.01 | `MuHashMul`
    14|            2,853.94 |          350,392.61 |    0.0% |      0.01 | `MuHashPrecompute`
    
  22. sipa commented at 5:04 pm on December 20, 2021: member
    @MarcoFalke Well, that’s expected at least. There are only x86_64 asm implementations here. @fjahr What kind of system/compiler/flags?
  23. sipa commented at 5:11 pm on December 20, 2021: member

    Benchmark on Ryzen 5950X, Ubuntu 21.10, GCC 11.2.0, default configure flags.

    master:

    ns/op op/s err% total benchmark
    4,133.94 241,900.24 0.5% 0.01 MuHash
    3,530.11 283,277.63 0.4% 0.01 MuHashDiv
    3,522.44 283,894.43 0.2% 0.01 MuHashMul
    554.91 1,802,107.77 0.1% 0.01 MuHashPrecompute

    This PR:

    ns/op op/s err% total benchmark
    3,157.98 316,658.14 0.3% 0.01 MuHash
    2,597.48 384,987.98 0.1% 0.01 MuHashDiv
    2,597.10 385,044.86 0.2% 0.01 MuHashMul
    556.71 1,796,271.37 0.0% 0.01 MuHashPrecompute
  24. maflcko commented at 9:21 am on December 21, 2021: member

    @MarcoFalke Well, that’s expected at least. There are only x86_64 asm implementations here.

    I wish I could read asm. :sweat_smile:

  25. fjahr commented at 9:43 pm on December 21, 2021: contributor

    @fjahr What kind of system/compiler/flags?

    Hmm, this was with Intel Core i5-6287U, darwin 21.2.0 (macOS 12.1), clang 13.0.0, default configure flags except for skipping the GUI.

  26. sipa commented at 10:13 pm on December 21, 2021: member
    @fjahr Did you disable frequency scaling? Mobile CPUs by default will change frequency all the time in response to load, leading to unreliable benchmarks.
  27. sipa commented at 10:31 pm on December 21, 2021: member

    @MarcoFalke

    I wish I could read asm. :sweat_smile:

    The #if defined(__amd64__) || defined(__x86_64__) also helps.

  28. in src/crypto/muhash.cpp:73 in 915ef08809 outdated
    68+    __asm__ ("imul %1,%0,%0" : "+r"(c1) : "i"(n) : "cc"); \
    69+    __asm__ ("addq %1,%0" : "+r"(c1) : "g"(th) : "cc"); \
    70+}
    71+
    72+/** [c0,c1] += a */
    73+#define add2(c0,c1,a) { \
    


    fanquake commented at 2:58 am on December 22, 2021:
    Looks like this is unused.

    fjahr commented at 11:36 pm on November 13, 2022:
    jepp, I removed it
  29. fjahr commented at 6:59 pm on December 26, 2021: contributor

    @fjahr Did you disable frequency scaling? Mobile CPUs by default will change frequency all the time in response to load, leading to unreliable benchmarks.

    I didn’t but I now have and the results are consistent. And previously I had also been running them about 5 times on different days already. I also compiled with clang 9 and tried again because that is my best guess of what I was running the last time I worked on this and still got the same results. 🤷‍♂️

  30. achow101 commented at 6:10 pm on October 12, 2022: member
    Are you still working on this?
  31. achow101 commented at 5:11 pm on November 10, 2022: member
    Closing this as it has not had any activity in a while. If you are interested in continuing work on this, please leave a comment so that it can be reopened.
  32. achow101 closed this on Nov 10, 2022

  33. kristapsk commented at 8:10 pm on November 10, 2022: contributor
    This seems to be just a matter of single unadressed review comment and more benchmarking. I could tak over if @fjahr is not interested anymore.
  34. fjahr commented at 11:36 pm on November 13, 2022: contributor

    This seems to be just a matter of single unadressed review comment and more benchmarking. I could tak over if @fjahr is not interested anymore.

    Thanks, but I am still happy to keep this alive. I will try to run benchmarks again myself, I am on a new machine yet again now. Of course more benchmarks help, so if you could run them as well that would be great @kristapsk .

    I addressed the comment from @fanquake and rebased the branch. @achow101 please reopen. I was a bit busy and forgot to respond here but I will try to not let it happen in the future :)

  35. achow101 commented at 0:03 am on November 14, 2022: member
    Since you’ve force pushed to the branch, GitHub won’t allow this to be reopened. I think if you push the commit that was the HEAD at the time of closing, it can be reopened. Or you can make a new PR.
  36. fjahr commented at 11:15 pm on November 14, 2022: contributor

    Since you’ve force pushed to the branch, GitHub won’t allow this to be reopened. I think if you push the commit that was the HEAD at the time of closing, it can be reopened. Or you can make a new PR.

    Alright, I pushed the old commit back to the remote branch.

  37. Add x86_64 assembly optimization for MuHash 82d4c7e3a8
  38. sipa reopened this on Nov 14, 2022

  39. sipa commented at 11:20 pm on November 14, 2022: member
    Reopened.
  40. aureleoules commented at 12:56 pm on November 15, 2022: member

    I rebased the PR on master and used default flags.

    NixOS 22.11pre420394.0e6df35f396 (Raccoon) x86_64 Intel(R) Core(TM) i5-8300H CPU @ 2.30GHz

    PR

    ns/op op/s err% ins/op cyc/op IPC bra/op miss% total benchmark
    16,250.77 61,535.55 0.5% 65,036.30 37,190.77 1.749 4,988.05 2.4% 0.01 MuHash
    13,684.71 73,074.28 0.3% 55,013.25 31,335.53 1.756 4,877.04 2.4% 0.01 MuHashDiv
    13,660.24 73,205.18 0.1% 55,013.25 31,276.80 1.759 4,877.04 2.4% 0.01 MuHashMul
    2,460.28 406,457.92 0.9% 10,020.05 5,625.06 1.781 111.01 0.1% 0.01 MuHashPrecompute

    Master

    ns/op op/s err% ins/op cyc/op IPC bra/op miss% total benchmark
    9,139.01 109,421.06 2.0% 55,144.17 20,963.14 2.631 4,986.03 2.6% 0.01 MuHash
    7,612.01 131,371.35 1.4% 45,121.14 17,450.67 2.586 4,875.02 2.6% 0.01 MuHashDiv
    7,521.41 132,953.88 0.3% 45,121.14 17,237.41 2.618 4,875.02 2.6% 0.01 MuHashMul
    1,442.51 693,236.58 0.1% 10,020.03 3,311.93 3.025 111.00 0.0% 0.01 MuHashPrecompute

    PR

    ns/op op/s err% ins/op cyc/op IPC bra/op miss% total benchmark
    9,940.04 100,603.22 0.1% 60,910.77 22,829.39 2.668 5,032.55 2.9% 0.01 MuHash
    8,491.45 117,765.58 0.2% 51,654.70 19,498.47 2.649 4,927.54 2.9% 0.01 MuHashDiv
    8,538.40 117,117.94 0.6% 51,653.69 19,531.16 2.645 4,927.53 2.9% 0.01 MuHashMul
    1,434.24 697,232.13 0.4% 9,246.03 3,288.00 2.812 105.00 0.0% 0.01 MuHashPrecompute

    Master

    ns/op op/s err% ins/op cyc/op IPC bra/op miss% total benchmark
    12,916.93 77,417.80 2.0% 95,142.80 29,627.26 3.211 5,032.55 2.6% 0.01 MuHash
    11,196.36 89,314.76 0.2% 85,886.76 25,660.90 3.347 4,927.55 2.5% 0.01 MuHashDiv
    11,266.59 88,757.98 0.8% 85,885.78 25,854.14 3.322 4,927.55 2.6% 0.01 MuHashMul
    1,452.53 688,453.46 1.2% 9,246.03 3,327.95 2.778 105.00 0.1% 0.01 MuHashPrecompute

    Ubuntu 22.04.1 LTS x86_64 Intel Xeon (8) @ 2.199GHz

    PR

    ns/op op/s err% total benchmark
    7,507.37 133,202.45 0.2% 0.01 MuHash
    6,249.94 160,001.54 0.5% 0.01 MuHashDiv
    6,215.10 160,898.37 0.4% 0.01 MuHashMul
    1,221.70 818,533.28 0.6% 0.01 MuHashPrecompute

    Master

    ns/op op/s err% total benchmark
    9,625.88 103,886.60 0.5% 0.01 MuHash
    8,307.33 120,375.61 0.5% 0.01 MuHashDiv
    8,309.12 120,349.69 0.4% 0.01 MuHashMul
    1,225.06 816,285.56 0.5% 0.01 MuHashPrecompute

    PR

    ns/op op/s err% total benchmark
    9,330.92 107,170.56 0.7% 0.01 MuHash
    8,061.01 124,053.97 0.3% 0.01 MuHashDiv
    8,083.18 123,713.72 0.2% 0.01 MuHashMul
    1,228.53 813,978.78 0.1% 0.01 MuHashPrecompute

    Master

    ns/op op/s err% total benchmark
    7,827.87 127,748.62 0.4% 0.01 MuHash
    6,550.98 152,648.92 0.2% 0.01 MuHashDiv
    6,591.85 151,702.45 0.7% 0.01 MuHashMul
    1,229.03 813,647.89 0.3% 0.01 MuHashPrecompute

    I seem to get consistently better results than master on GCC but also consistently worst results on clang.

  41. fjahr force-pushed on Nov 20, 2022
  42. theStack commented at 3:37 pm on November 29, 2022: contributor

    Results using gcc 11.3.0 and default configuration on a AMD EPYC 7702P (running Ubuntu 22.04):

    master:

    ns/op op/s err% total benchmark
    6,810.64 146,829.16 2.5% 0.01 MuHash
    5,947.59 168,135.40 1.0% 0.01 MuHashDiv
    5,756.02 173,731.09 0.8% 0.01 MuHashMul
    838.00 1,193,321.69 0.4% 0.01 MuHashPrecompute

    PR:

    ns/op op/s err% total benchmark
    5,457.87 183,221.54 0.4% 0.01 MuHash
    4,605.19 217,146.38 0.9% 0.01 MuHashDiv
    4,615.33 216,669.07 0.7% 0.01 MuHashMul
    848.51 1,178,540.85 0.7% 0.01 MuHashPrecompute

    For a more practical test, I ran gettxoutset muhash without coinstatsindex on mainnet height 765184 both on master and the PR (started with -nolisten -noconnect to avoid distractions for the benchmark), showing a nice ~18% speedup:

    master:

     0$ time ./src/bitcoin-cli gettxoutsetinfo muhash
     1{
     2  "height": 765184,
     3  "bestblock": "00000000000000000006923ee26b9b3d271035b5cdf79f4915d8453cb3a6f305",
     4  "txouts": 83195150,
     5  "bogosize": 6203273828,
     6  "muhash": "b645663cd8a7a4b6083a84940199f17125232ab4b126602ed2aa054844503393",
     7  "total_amount": 19219692.16624067,
     8  "transactions": 49805302,
     9  "disk_size": 5992277808
    10}
    11
    12real    6m28.066s
    13user    0m0.003s
    14sys     0m0.001s
    

    PR:

     0$ time ./src/bitcoin-cli gettxoutsetinfo muhash
     1
     2
     3{
     4  "height": 765184,
     5  "bestblock": "00000000000000000006923ee26b9b3d271035b5cdf79f4915d8453cb3a6f305",
     6  "txouts": 83195150,
     7  "bogosize": 6203273828,
     8  "muhash": "b645663cd8a7a4b6083a84940199f17125232ab4b126602ed2aa054844503393",
     9  "total_amount": 19219692.16624067,
    10  "transactions": 49805302,
    11  "disk_size": 5991699018
    12}
    13
    14real    5m28.506s
    15user    0m0.003s
    16sys     0m0.000s
    

    I seem to get consistently better results than master on GCC but also consistently worst results on clang.

    Interesting, will als repeat the above test runs using clang in a bit to see if I can observe the same effect.

  43. theStack commented at 3:08 pm on December 1, 2022: contributor

    I seem to get consistently better results than master on GCC but also consistently worst results on clang.

    I can confirm that the MuHash performance with this PR is worse than on master when compiling using clang 14.0.0 (default configuration on a AMD EPYC 7702P, running Ubuntu 22.04). Benchmark results:

    master:

    ns/op op/s err% total benchmark
    5,476.23 182,607.35 1.8% 0.01 MuHash
    4,570.74 218,782.88 1.0% 0.01 MuHashDiv
    4,547.88 219,882.44 0.9% 0.01 MuHashMul
    814.77 1,227,343.46 0.8% 0.01 MuHashPrecompute

    PR:

    ns/op op/s err% total benchmark
    5,799.24 172,436.30 0.6% 0.01 MuHash
    5,022.26 199,113.39 1.3% 0.01 MuHashDiv
    4,954.25 201,847.07 0.3% 0.01 MuHashMul
    828.04 1,207,671.75 0.8% 0.01 MuHashPrecompute

    And the gettxoutsetinfo muhash tests for mainnet block 765184:

    master:

     0$ time ./src/bitcoin-cli gettxoutsetinfo muhash
     1{
     2  "height": 765184,
     3  "bestblock": "00000000000000000006923ee26b9b3d271035b5cdf79f4915d8453cb3a6f305",
     4  "txouts": 83195150,
     5  "bogosize": 6203273828,
     6  "muhash": "b645663cd8a7a4b6083a84940199f17125232ab4b126602ed2aa054844503393",
     7  "total_amount": 19219692.16624067,
     8  "transactions": 49805302,
     9  "disk_size": 5991686963
    10}
    11
    12real    5m36.934s
    13user    0m0.002s
    14sys     0m0.002s
    

    PR:

     0$ time ./src/bitcoin-cli gettxoutsetinfo muhash
     1{
     2  "height": 765184,
     3  "bestblock": "00000000000000000006923ee26b9b3d271035b5cdf79f4915d8453cb3a6f305",
     4  "txouts": 83195150,
     5  "bogosize": 6203273828,
     6  "muhash": "b645663cd8a7a4b6083a84940199f17125232ab4b126602ed2aa054844503393",
     7  "total_amount": 19219692.16624067,
     8  "transactions": 49805302,
     9  "disk_size": 5991686963
    10}
    11
    12real    5m52.723s
    13user    0m0.004s
    14sys     0m0.000s
    

    (Sorry for the long double-posts, I should summarize both on them on a small summary table)

    Based on those results, if at all, those optimizations should probably only be enabled if GCC is used? (Not sure if it’s worth the review and maintenance burden though.)

  44. achow101 commented at 3:37 pm on April 25, 2023: member
  45. real-or-random commented at 4:13 pm on April 25, 2023: contributor

    Based on those results, if at all, those optimizations should probably only be enabled if GCC is used? (Not sure if it’s worth the review and maintenance burden though.)

    Some of the code changes here conflict with #21590, and given that clang 14 is better than this ASM, we should maybe get benchmarks on a recent GCC (like 12.2) and see if it’s faster. If not, I also wonder if there’s a better way to convince GCC to output good code. @theStack Can you re-benchmark with GCC 12 ? I can also try to get some numbers on my machine.

  46. maflcko commented at 2:19 pm on April 26, 2023: member
    Maybe even with gcc 13.1, now that it is about to be released?
  47. fjahr commented at 7:29 pm on May 1, 2023: contributor

    I finally got another amd64 machine with which I can test again. I confirmed the clang-14 results others have seen and I tested with GCC 13.1. The results still look good there for me but would be great if someone else could confirm.

    (using src/bench/bench_bitcoin -filter=MuHash.* -min-time=1000)

    Master:

    0|               ns/op |                op/s |    err% |     total | benchmark
    1|--------------------:|--------------------:|--------:|----------:|:----------
    2|            9,000.29 |          111,107.54 |    1.6% |      1.09 | `MuHash`
    3|            7,424.35 |          134,691.90 |    0.0% |      1.05 | `MuHashDiv`
    4|            7,428.86 |          134,610.13 |    0.2% |      1.06 | `MuHashMul`
    5|            1,058.97 |          944,314.00 |    0.0% |      1.08 | `MuHashPrecompute`
    

    This PR:

    0|               ns/op |                op/s |    err% |     total | benchmark
    1|--------------------:|--------------------:|--------:|----------:|:----------
    2|            6,936.64 |          144,161.92 |    0.1% |      1.08 | `MuHash`
    3|            5,596.30 |          178,689.37 |    0.2% |      1.03 | `MuHashDiv`
    4|            5,422.52 |          184,415.94 |    0.7% |      1.08 | `MuHashMul`
    5|            1,061.63 |          941,950.39 |    0.0% |      1.09 | `MuHashPrecompute`
    

    I guess it makes sense to only use this when GCC is used which should work with this

    0#if (defined(__amd64__) || defined(__x86_64__)) && defined(__GNUC__) && !defined(__clang__)
    

    but I would also be curious if there is another way to tell GCC to do the same optimizations clang seems to be doing.

    I will also try to combine the changes here with #21590 to see what the combined impact is.

  48. real-or-random commented at 8:21 am on May 2, 2023: contributor

    #if (defined(__amd64__) || defined(__x86_64__)) && defined(__GNUC__) && !defined(__clang__)

    If it’s that niche, it’s a bit unclear to me whether it’s worth the hassle. I feel we should look at #21590 first. I expect this to be a much larger improvement (and since it’s algorithmic, it will apply to all targets). Perhaps we don’t care about this optimization here so much after #21590.

  49. fjahr commented at 2:10 pm on May 2, 2023: contributor

    I have now tested SafeGCD vs SafeGCD+ASM (see https://github.com/fjahr/bitcoin/tree/pr21590-safegcd-asm) and the gains from including the ASM code are still substantial.

    GCC 13.1 - SafeGCD only:

    0|               ns/op |                op/s |    err% |     total | benchmark
    1|--------------------:|--------------------:|--------:|----------:|:----------
    2|            9,168.73 |          109,066.35 |    2.1% |      1.06 | `MuHash`
    3|            7,571.75 |          132,069.88 |    0.2% |      1.05 | `MuHashDiv`
    4|           75,079.98 |           13,319.13 |    0.0% |      1.06 | `MuHashFinalize`
    5|            7,322.87 |          136,558.39 |    0.1% |      1.05 | `MuHashMul`
    6|            1,052.74 |          949,904.92 |    0.0% |      1.09 | `MuHashPrecompute`
    

    GCC 13.1 - SafeGCD + ASM:

    0|               ns/op |                op/s |    err% |     total | benchmark
    1|--------------------:|--------------------:|--------:|----------:|:----------
    2|            6,924.98 |          144,404.76 |    0.2% |      1.08 | `MuHash`
    3|            5,631.55 |          177,571.08 |    0.1% |      1.09 | `MuHashDiv`
    4|           70,266.78 |           14,231.48 |    0.8% |      1.05 | `MuHashFinalize`
    5|            5,454.11 |          183,347.95 |    0.1% |      1.09 | `MuHashMul`
    6|            1,051.74 |          950,801.50 |    0.0% |      1.09 | `MuHashPrecompute`
    

    @real-or-random while this change here is niche in terms of the tooling+architecture it targets, the SafeGCD change only impacts the finalize operation and that has a much smaller impact on the overall computation time because of how we use MuHash in practice. When a new block comes in we do a div of all the spent TXOs and a mul of all the new UTXOs and then at the end we finalize once. So I still have a hard time deciding which one is the more valuable change. Also, while ASM might be a bit intimidating, #21590 has 10x the LOC changed and requires some understanding of the underlying theory.

  50. real-or-random commented at 2:17 pm on May 2, 2023: contributor

    the SafeGCD change only impacts the finalize operation and that has a much smaller impact on the overall computation time because of how we use MuHash in practice. When a new block comes in we do a div of all the spent TXOs and a mul of all the new UTXOs and then at the end we finalize once.

    Okay, thanks for the numbers. I agree then, we shouldn’t neglect this one.

    So I still have a hard time deciding which one is the more valuable change. Also, while ASM might be a bit intimidating, #21590 has 10x the LOC changed and requires some understanding of the underlying theory.

    Yeah, they both have their merits then, and I don’t think any should be prioritized over the other. Let’s do (first) whatever PR we get the ACKs on. (And yes, I expect #21590 to be harder to review…)

    but I would also be curious if there is another way to tell GCC to do the same optimizations clang seems to be doing.

    That would be ideal. I’ll try to have a look at this.

  51. real-or-random commented at 4:05 pm on May 2, 2023: contributor

    but I would also be curious if there is another way to tell GCC to do the same optimizations clang seems to be doing.

    I played around with this a bit, and I don’t see any obvious trick to make that work. If someone else wants to give it a try, https://gcc.godbolt.org/z/hhGfeEoKq could be a nice starting point.

  52. DrahtBot commented at 9:16 am on August 8, 2023: contributor

    There hasn’t been much activity lately. What is the status here?

    Finding reviewers may take time. However, if the patch is no longer relevant, please close this pull request. If the author lost interest or time to work on this, please close it and mark it ‘Up for grabs’ with the label, so that it can be picked up in the future.

  53. fjahr commented at 3:02 pm on August 8, 2023: contributor

    There hasn’t been much activity lately. What is the status here?

    Finding reviewers may take time. However, if the patch is no longer relevant, please close this pull request. If the author lost interest or time to work on this, please close it and mark it ‘Up for grabs’ with the label, so that it can be picked up in the future.

    Still relevant… How good is your ASM, @DrahtBot ? 😁

  54. achow101 requested review from theStack on Sep 20, 2023
  55. achow101 requested review from sipa on Sep 20, 2023
  56. sipa commented at 2:31 pm on September 29, 2023: member
    It’s a rather unsatisfying situation that a compiler produces better code than hand-written assembly. One possibility is just taking the asm generated by clang 14 and including that as asm blocks in the C++ code?
  57. real-or-random commented at 2:44 pm on September 29, 2023: contributor

    It’s a rather unsatisfying situation that a compiler produces better code than hand-written assembly. One possibility is just taking the asm generated by clang 14 and including that as asm blocks in the C++ code?

    Yes, but what would be a proper way of reviewing such a PR? Just comparing with the clang output? If we think that’s sufficient, then that’s a possible way forward.

  58. real-or-random commented at 11:30 am on November 8, 2023: contributor

    There’s a lot of activity recently in GCC towards generating adc instructions on x86(_64): https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79173 But the current GCC trunk (14) so far hasn’t improved over GCC 13 for over code. The GCC trhead also has some hints at possible other implementations such as GCC builtins, which may be simpler to review.

    I don’t know. Checking the asm annotations is not trivial and not many people are familiar with inline asm. But on the other hand, I’m not saying that this PR is a huge review effort. It’s just a handful of small functions, and someone needs to take the time.

  59. DrahtBot added the label CI failed on Nov 27, 2023
  60. DrahtBot removed the label CI failed on Dec 10, 2023
  61. DrahtBot added the label CI failed on Jan 24, 2024
  62. achow101 commented at 2:53 pm on April 9, 2024: member

    One possibility is just taking the asm generated by clang 14 and including that as asm blocks in the C++ code?

    Seems like we should do that.

  63. achow101 commented at 2:54 pm on April 9, 2024: member
    Closing as up for grabs due to lack of activity.
  64. achow101 closed this on Apr 9, 2024

  65. achow101 added the label Up for grabs on Apr 9, 2024
  66. laanwj commented at 3:02 pm on April 9, 2024: member

    One possibility is just taking the asm generated by clang 14 and including that as asm blocks in the C++ code?

    In general i really dislike the idea of copy/pasting assembly output from a compiler into the source code. It’s already hard enough to review human-generated asm code but at least you can ask the author about the reasoning how and why. In the case of compiler output, well, it’d be a matter of waiting for gcc to improve :slightly_smiling_face:

  67. maflcko removed the label Up for grabs on Apr 9, 2024
  68. maflcko commented at 3:06 pm on April 9, 2024: member
    Removing “up for grabs” for now. I don’t think anyone will review asm, regardless of where it came from? If there are reviewers who would review it, they should speak up first, no?
  69. laanwj commented at 3:18 pm on April 9, 2024: member
    To be clear, I’m happy to review asm if there’s 1) a very clear performance win in an important part of the code 2) it’s human-written and well commented 3) it’s only small and relatively straightforward, self-contained operations. With how good compilers are nowadays it should be rare, though. With new instruction sets it’s generally better (also for review-easier to check data flow) to use intrinsics instead of direct inline assembly.

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-11-17 06:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me