Disable restrict keyword in GCC/Clang #586

pull sipa wants to merge 1 commits into bitcoin-core:master from sipa:201901_norestrict changing 1 files +8 −8
  1. sipa commented at 6:46 pm on January 18, 2019: contributor
    This fixes #585.
  2. TheBlueMatt commented at 6:49 pm on January 18, 2019: contributor
    You can water down the comment a bit, I think. Its rather unlikely to affect secp, but its also unlikely to speed anything up, so just not worth it. utACK either way.
  3. real-or-random commented at 6:50 pm on January 18, 2019: contributor
    utACK
  4. Disable restrict keyword in GCC/Clang f4f0e08191
  5. sipa force-pushed on Jan 18, 2019
  6. sipa commented at 6:57 pm on January 18, 2019: contributor

    I very rudimentary benchmark shows it indeed doesn’t affect verification speed.

    I’ll do a more thorough benchmark later.

  7. jgarzik commented at 8:11 pm on January 18, 2019: none
    utACK
  8. sipa commented at 10:50 pm on February 4, 2019: contributor
    I’ve done some more benchmarking, and can’t observe a slowdown in verification (if there is one, it’s less than 0.2%).
  9. sipa commented at 10:42 pm on February 5, 2019: contributor

    Oops, no. That was with assembly code enabled, which obviously takes the restrictness into account, but isn’t affected by the restrict keyword.

    When I disable asm, verification goes from 277000 cycles to 282000 cycles on a Threadripper 2950X system, a 1.8% slowdown.

  10. real-or-random commented at 3:33 pm on February 6, 2019: contributor
    1.8% sounds reasonable enough to me for the added safety.
  11. gmaxwell commented at 0:44 am on February 15, 2019: contributor

    Meh: 1.8% is a rather large slowdown to take for a “maybe the compiler might screw up in ways that the tests don’t find”. Assuming the 1.8% number applies on ARM, that change is essentially 1-2 hours of additional sync time on a non-assumevalid initial sync on a small bitcoin node.

    Here is my reasoning: There are a significant number of optimizations in the codebase with 1% - 3% level improvements which could be excluded on the basis of “maybe they get miscompiled” or “maybe they have a bug”.

    If the tests (including run time self tests) aren’t adequate to catch all plausible miscompilation, we should improve them. This has an added benefit of also protecting the code against unforseen risks, something that avoiding long standing language features does not accomplish.

    Or, if 1% vs the level of risk that using a less common compiler annotation implies is really the trade-off we want to take we should probably remove all the distinct normalization cases (and replace all with the a single constant time full normalization), the R comparison in projective space optimization, x86_64 assembly (which is now not much faster than what the compiler can produce), effective affine, etc. Each of which has 1-2% performance impacts and arguably has a worse risk/performance trade-off than the use of restrict. Similarly, under that logic we should probably remove endomorphism and gmp support from the codebase (zero performance gain in the normal config, non-zero risk of introducing bugs in the normal configuration).

    I don’t actually think the above is a good idea: I think our principle before has been to include essentially any optimization– even relatively complex ones– which would make a measurable difference in a real system subject to the constraint that we could test it well enough to be reasonably confident that its safe. But since there is essentially no competing library for this curve which I’m aware of which is anywhere near as fast, I think a case could be made for simplifying the codebase until it is merely significantly faster than the next less-well-tested alternative. To the extent that safety is our primary goal we only need to be fast enough to ensure that users who mostly care about speed select this code over a less safe alternative. [also, the audience of embedded devices that frequently adopt obviously less safe code might be swayed by a smaller binary sizes…]

    But whatever we do, we should try to be intentional and consistent in our strategy so that we get the full benefit of it.

    I also think that a case could be made for doing this if it were shown that the slowdown didn’t happen on ARM (since that’s where a performance loss is especially critical), or as a temporary measure while tests were improved.

  12. real-or-random commented at 7:06 pm on February 15, 2019: contributor

    Each of which has 1-2% performance impacts and arguably has a worse risk/performance trade-off than the use of restrict.

    I don’t think that’s so clear. This library offers exceptional quality due to

    • careful programming,
    • careful review, and
    • good tests.

    Unfortunately, the first two bullets don’t help in case of miscompilation. Tests do help of course but tests rarely show the absence of bugs. Another concern is that compilers behave differently on different architectures, with different versions, different flags, etc, and it’s not clear if people run the tests every time after they compile the code.

    And sure, other code can be miscompiled, too. But without a specific bug in mind, there is no reason why a specific implementation (say projective space stuff) should be more prone to miscompilation than other, except that it uses more lines of code.

    In our case, we know that these specific compiler bugs just won’t happen if we disable restrict. The scope of the compiler bug is quite narrow and we know what to do to prevent it, and we know the slowdown.

    If gcc released a new flag -finline-experimental-1.8 with the description “speeds up your code by 1.8 % but we’ve seen cases of miscompilation”, would you enable it in secp if the tests pass? I wouldn’t.

  13. gmaxwell commented at 9:56 am on February 20, 2019: contributor

    Another concern is that compilers behave differently on different architectures, with different versions, different flags, etc, and it’s not clear if people run the tests every time after they compile the code.

    This is not a justification for this change. It is a justification for runtime self-tests. I agree that the concern that the tests aren’t getting run where they need to be run is a good one, but the answer to it isn’t not to make a particular tradeoff one way for one thing, the answer is to fix the underlying.

    Again, I still hold that this particular decision doesn’t carry unique risks compared to a dozen other choices, nor do I think a case has been made that it has a worse trade-off.

    And sure, other code can be miscompiled, too. But without a specific bug in mind, there is no reason why a specific implementation (say projective space stuff) should be more prone to miscompilation than other,

    If you really want, I can go and justify the other examples in light of prior miscomplation bugs specifically, but I think it would be a waste of time. Besides, “except that it uses more lines of code” is a perfectly fine justification on its own. (not more lines precisely but more total machine code with more total corner cases)

    If gcc released a new flag -finline-experimental-1.8 with the description “speeds up your code by 1.8 % but we’ve seen cases of miscompilation

    The “new” kind of confuses the issue– anything new also carries a higher degree of unknown unknowns–, but we certainly compile with optimizations and there is a long and illustrious history of compiler optimizations and miscompilation. The benefit is (presumably!) more than 1.8% but the risk, historically, is that there have been a tremendous number of miscomplations which were optimization linked.

    GCC’s bug tracker is filled with scads of reports of miscompilation both current and in recent history (some even reported by some of us in the past…). But no attention is being given to that, only to this one thing that someone noticed. This seems to me to be a reactionary response ungrounded in a systematic approach to managing the tradeoffs of software quality.

  14. gmaxwell commented at 10:09 am on February 20, 2019: contributor
    As an aside, the linked GCC misbehaviour only occurs with -O3 (the same applies to the similar behaviour in LLVM) … which fits nicely with my above comment about optimization. I’d find an argument to not use O3 in general to be more compelling…
  15. in src/util.h:97 in f4f0e08191
    100 # else
    101-#  define SECP256K1_RESTRICT restrict
    102+/* As both clang and GCC have known miscompilation bugs related to restrict, disable it for now.
    103+ * TODO: re-evaluate when this bugs are fixed, and enable restrict in known good compilers.
    104+ * See:
    105+ * - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87609
    


    DesWurstes commented at 5:20 pm on February 22, 2019:
    I think the referenced bugs give the impression that restrict is forbidden for those compilers because of those bugs, while the real reason is “it works one day, it fails the other day”. GCC has just, 8 hours before, fixed the bug in master. Upstream fix doesn’t mean everyone is using the latest compiler, as even GCC 8.3, released 2 hours ago, doesn’t have the patch.
  16. jonasnick commented at 6:10 pm on February 26, 2019: contributor
    Running this PR on an aarch64 (ARMv8) based ODROID C2 I get a 2.43% slowdown (452us vs 463us) in bench_verify. This is without using ASM which is not enabled by default on ARM architectures.
  17. gmaxwell commented at 0:03 am on March 7, 2019: contributor
    @jonasnick Can you try benchmarking on the same device with restrict but with -O2 instead of O3?
  18. jonasnick commented at 5:43 pm on March 7, 2019: contributor
    0O2:  min 450us / avg 450us / max 451us
    1O3:  min 452us / avg 452us / max 461us
    

    All the measurements were done after increasing the ecdsa_verify benchmark “count” from 10 to 100

  19. gmaxwell commented at 9:27 pm on March 9, 2019: contributor
    ^ If we want to be paranoid, I think Jonas’ results would support using O2 instead of O3, not only does it address the restrict concern in both GCC and Clang but also a long history of more bugs at O3.
  20. sipa commented at 11:02 pm on March 12, 2019: contributor

    I don’t think we ever (neither in this repo, or in the Bitcoin Core build) use -O3 by default?

    Scratch that, the default is -O3 apparently.

  21. gmaxwell commented at 6:20 am on March 13, 2019: contributor
    At one point earlier I almost responded with “this restrict issue only exists at O3, lets not worry about it” … then though “hm. wait a minute.”… O3 is also why asm vs non-asm doesn’t make that big a difference on x86_64 IIRC.
  22. real-or-random commented at 10:59 pm on January 7, 2020: contributor

    Fwiw, this is fixed in GCC 7, 8 ,9 (>=7.4.1, >= 8.3.1, >= 9.0) according to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87609.

    It’s not yet fixed in LLVM but patches have been proposed already.

  23. real-or-random commented at 7:57 pm on March 26, 2020: contributor

    As an aside, the linked GCC misbehaviour only occurs with -O3 (the same applies to the similar behaviour in LLVM) … which fits nicely with my above comment about optimization. I’d find an argument to not use O3 in general to be more compelling…

    Now that we switched to O2, I hoped that this issue had been resolved on the way. But apparently, the miscompilation of the example program from the GCC bug only disappears on GCC, where it’s fixed anyway in the most recent versions. Clang/LLVM does loop unrolling even with O2, so you additionally need -fno-unroll-loops (https://godbolt.org/z/urkkWD), which seems to have some performance impact at least if x86_64 asm is disabled (gmp does not make a big difference).

     0./configure --with-bignum=gmp --with-asm=x86_64 CC=clang CFLAGS=-fno-unroll-loops
     1ecdsa_verify: min 81.3us / avg 83.4us / max 85.6us
     2ecdsa_verify: min 81.9us / avg 82.9us / max 84.3us
     3
     4./configure --with-bignum=gmp --with-asm=x86_64 CC=clang CFLAGS= 
     5ecdsa_verify: min 81.4us / avg 83.7us / max 86.5us
     6ecdsa_verify: min 81.4us / avg 83.1us / max 87.0us
     7
     8./configure --with-bignum=no --with-asm=no CC=clang CFLAGS= 
     9ecdsa_verify: min 97.1us / avg 98.6us / max 100us
    10
    11./configure --with-bignum=no --with-asm=no CC=clang CFLAGS=-fno-unroll-loops
    12ecdsa_verify: min 95.9us / avg 101us / max 104us
    

    500000 runs each.

    I’m not a friend of tweaking flags too much but given that there are anyway performance differences between gcc and clang and we unrolling is disabled in gcc, I could indeed imagine adding -fno-unroll-loops if other can confirm my benchmarks.

  24. real-or-random closed this on Dec 20, 2022


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin-core/secp256k1. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-01-24 02:15 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me