Revert merge of PR #31826 #31908

pull darosior wants to merge 1 commits into bitcoin:master from darosior:2502_revert_pr_31817 changing 1 files +11 −34
  1. darosior commented at 3:33 pm on February 19, 2025: member

    PR #31826 was merged despite the code not compiling.

    #31902 was opened to fix the code but since this code is only targeting a not officially supported platform, we don’t have a CI in place to compile and run tests on this platform, neither apparently reviewers do (nor does the author?), don’t take more risk right before 29 and revert the original broken PR.

  2. Revert "Merge bitcoin/bitcoin#31826: random: Check `GetRNDRRS` is supported in `InitHardwareRand` to avoid infinite loop"
    This reverts commit 139640079ff52de5a7935e98225da76686ef32cb, reversing
    changes made to dc3a71463373e07e1c8956173c039e61fecbf029.
    3e9b12b3e0
  3. DrahtBot commented at 3:33 pm on February 19, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31908.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK sipa, TheCharlatan, eval-exec, laanwj, achow101

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

  4. glozow added this to the milestone 29.0 on Feb 19, 2025
  5. glozow commented at 3:51 pm on February 19, 2025: member
    Up to reviewers to decide whether this or #31902 is more appropriate, but added to milestone so we do this before branch off.
  6. sipa commented at 4:01 pm on February 19, 2025: member

    ACK 3e9b12b3e0f039a8760410afed74c7e4d15afbe6

    We shouldn’t have merged #31826 if it didn’t compile, and apparently nobody actually tested it. It’s one thing to make changes for an unsupported platform that don’t hurt others, but clearly the bar for testing was too low if it actually doesn’t compile at all.

  7. dergoegge commented at 4:05 pm on February 19, 2025: member

    neither apparently reviewers do (nor does the author?)

    It was actually tested on the merged PR: see #31826 (comment) and #31826 (comment) but a later refactoring suggestion (https://github.com/bitcoin/bitcoin/pull/31826#discussion_r1955192539) introduced the compilation issue and no re-testing was done before merge.

    I think if @laanwj and @giahuy98 are willing to re-test on #31902 we could consider merging the fix instead.

  8. TheCharlatan approved
  9. TheCharlatan commented at 4:06 pm on February 19, 2025: contributor
    ACK 3e9b12b3e0f039a8760410afed74c7e4d15afbe6
  10. eval-exec commented at 4:09 pm on February 19, 2025: contributor
  11. glozow requested review from laanwj on Feb 19, 2025
  12. laanwj commented at 4:51 pm on February 19, 2025: member

    We shouldn’t have merged #31826 if it didn’t compile, and apparently nobody actually tested it. It’s one thing to make changes for an unsupported platform that don’t hurt others, but clearly the bar for testing was too low if it actually doesn’t compile at all.

    i did actually test it extensively, but yes it was changed and i hadn’t gotten around to reviewing or re-testing it yet.

  13. laanwj approved
  14. laanwj commented at 4:57 pm on February 19, 2025: member
    ACK 3e9b12b3e0f039a8760410afed74c7e4d15afbe6 i have checked that this is an exact revert of 09b150bb8adae00854f02ece69fc6ef222fb07d9
  15. achow101 commented at 5:20 pm on February 19, 2025: member

    since this code is only targeting a not officially supported platform

    Ostensibly we do though, as the code is part of the aarch64 platform which we do have CI for. However, it appears that our ARM CI does not have HWCAP2_RNG, but it could if we switched to using AWS Graviton instances.

  16. laanwj commented at 5:30 pm on February 19, 2025: member

    Ostensibly we do though, as the code is part of the aarch64 platform which we do have CI for. However, it appears that our ARM CI does not have HWCAP2_RNG, but it could if we switched to using AWS Graviton instances.

    Having the ARM64 RNG code in the first place makes sense, and we could test this on Graviton instances.

    But also agree that a workaround for an errata of specific SoC series is really on the edge of what is in scope for this project. Or user code in the first place, for that matter. Ideally a broken implementation of a feature is fixed in the kernel. Either by disabling the functionality on that specific SoC or another workaround like a microcode update or a hook for the instruction.

  17. achow101 commented at 6:01 pm on February 19, 2025: member
    ACK 3e9b12b3e0f039a8760410afed74c7e4d15afbe6
  18. achow101 merged this on Feb 19, 2025
  19. achow101 closed this on Feb 19, 2025

  20. jonatack commented at 6:11 pm on February 19, 2025: member

    (Why my approve button can’t click?

    I believe it was disabled for non-members of the repository because we were seeing too much spamming from that functionality.

  21. maflcko commented at 10:21 pm on February 19, 2025: member

    Ostensibly we do though, as the code is part of the aarch64 platform which we do have CI for. However, it appears that our ARM CI does not have HWCAP2_RNG, but it could if we switched to using AWS Graviton instances.

    Having the ARM64 RNG code in the first place makes sense, and we could test this on Graviton instances.

    I don’t think they have a 32-bit mode, so we’d have to drop that:

    https://github.com/bitcoin/bitcoin/blob/e606c577cb257d0927ca6ea939aab06cf0639aad/ci/test/00_setup_env_arm.sh#L9

    So if something is added, it would be a new task?

    Moreover, there are also GHA runners with HWCAP2_RNG set, but they are in beta and sometimes fail to start.

  22. hugohn commented at 8:28 am on February 20, 2025: contributor

    Just wanted to add that even though we discovered the issue on mobile devices (Samsung S25 and OnePlus 13), it looks like the same chip will be shipped in several laptop brands/models as well: https://www.google.com/search?q=laptop+snapdragon+x+elite

    My colleague @giahuy98 did test the original fix (https://github.com/bitcoin/bitcoin/pull/31826) on a real S25 device (it works), and @laanwj tested as well. Unfortunately, after that, there were further code changes (code refactoring) that did not get tested.

  23. laanwj commented at 9:42 am on February 20, 2025: member

    I don’t think they have a 32-bit mode, so we’d have to drop that:

    They seem to have Aarch32 support! To verify this i spun up a m7g (Graviton 3) instance, running Ubuntu:

    0$ uname -a
    1Linux ip-172-31-13-55 6.8.0-1021-aws [#23](/bitcoin-bitcoin/23/)-Ubuntu SMP Mon Dec  9 23:51:16 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
    2$ dpkg --add-architecture armhf
    3$ apt update
    4$ apt install libc6:armhf libstdc++6:armhf gcc-arm-linux-gnueabihf
    5$ arm-linux-gnueabihf-gcc test.c -o test
    6$ file test
    7test: ELF 32-bit LSB pie executable, ARM, EABI5 version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-armhf.so.3, BuildID[sha1]=285d66ef6592b9a8d2db6c174103d3d70b245ba8, for GNU/Linux 3.2.0, not stripped
    8$ ./test
    9hello world
    

    Just wanted to add that even though we discovered the issue on mobile devices (Samsung S25 and OnePlus 13), it looks like the same chip will be shipped in several laptop brands/models as well: https://www.google.com/search?q=laptop+snapdragon+x+elite

    Not exactly the same chip (X Elite versus 8 Elite). But t’s possible that they have the same problem. Would be good if someone with one of those laptops would test it. @achow101 suggested doing so in WSL might be easiest as they’re nominally Windows devices, and we don’t build Bitcoin Core for ARM64 Windows.

    To be clear my point is not that this isn’t worth fixing. The question is where.

    I think it’s ridiculous to have to implement a fix for a CPU issue in all end-user software. But it’s not totally unheard of. These kind of model-specific workarounds happen a lot in gamedev GPU programming, historically a wasteland of out-of-spec hardware and drivers, but it’s somewhat worrying to see it crossing to CPUs which you’d expect to have a higher standard.

    As said, ideally a fix would land in the kernel, as this would automatically fix userland. Linux support for these devices is still in the early stage, so it’s not unimaginable.

    But testing and ACKing the original PR says enough in that i’m also not against to carrying the change, if necessary. It’s not a big complication.

  24. maflcko commented at 10:32 am on February 20, 2025: member

    I don’t think they have a 32-bit mode, so we’d have to drop that:

    They seem to have Aarch32 support! To verify this i spun up a m7g (Graviton 3) instance, running Ubuntu:

    Thanks for checking. I guess we’d still have to drop the 32-bit cross compilation (to compile the code here), if we wanted to do it in CI (or add a separate task)? Though, I wonder if CI is a good way to test such changes, because CI doesn’t print the debug logs on success and part of the testing involves looking at debug logs.

    As a side note, for testing the happy-case, apart from m7g on aws, one can also use c4a on gcp, or the GHA arm runner (no idea what it is, but the docs are here: https://github.com/actions/partner-runner-images/blob/main/images/arm-ubuntu-24-image.md)

  25. laanwj commented at 10:49 am on February 20, 2025: member

    because CI doesn’t print the debug logs on success and part of the testing involves looking at debug logs.

    Right, to make it usefully tested in CI we’d need a specific test case for it, that checks the instruction set is successfully detected and used. Otherwise all you’re testing is “it doesn’t hang”. Which is, something at least.

  26. hugohn commented at 0:44 am on February 21, 2025: contributor

    To be clear my point is not that this isn’t worth fixing. The question is where.

    I think it’s ridiculous to have to implement a fix for a CPU issue in all end-user software. But it’s not totally unheard of. These kind of model-specific workarounds happen a lot in gamedev GPU programming, historically a wasteland of out-of-spec hardware and drivers, but it’s somewhat worrying to see it crossing to CPUs which you’d expect to have a higher standard.

    As said, ideally a fix would land in the kernel, as this would automatically fix userland. Linux support for these devices is still in the early stage, so it’s not unimaginable.

    But testing and ACKing the original PR says enough in that i’m also not against to carrying the change, if necessary. It’s not a big complication.

    I think one could argue that the fix should be made both in the kernel and in Core.

    The design of GetRandBytes() and GetStrongRandBytes() is about being robust and sourcing entropy from multiple sources, so that even if one or some of those sources fail, Core can still obtain sufficiently good entropy. There could be multiple reasons hardware entropy fails to return lower down in the stack.

    From Nunchuk’s perspective, we reuse Core code directly and have thousands of users on these mobile devices. We have no certainty about when or whether a kernel fix will land on these devices. If there’s no fix in Core itself, we’d have to carry our own patch in a separate fork, which would be brittle and unfortunate.

  27. hugohn commented at 1:08 am on February 21, 2025: contributor

    Some additional context: the PR that introduced RNDR/RNDRRS support for AArch64 (#26839) was merged about a year ago, and AFAIK it was only tested on Amazon Graviton 3 before merging—so it’s not like the feature itself has been extensively battle-tested across different SoCs.

    If someone feels this fix is too specific to a single chip and wants to revert it, the same reasoning could apply to the original PR itself. That might imply reverting the whole RNDR/RNDRRS feature, which doesn’t seem ideal given it works fine on other hardware and improves performance. Instead, I think it’s reasonable to fix it in Core and encourage a proper fix in the kernel, rather than remove the feature outright.

  28. sipa commented at 8:46 pm on February 21, 2025: member

    If we accept the fact that we have assembly RNDR/RNDRRS code in our project, I don’t think it’s unreasonable that this also means taking on the responsibility of dealing with weird hardware incompatibility issues it causes, in general. If we don’t want that, we can choose to not have this support at all and just rely on the kernel’s RNG.

    But having merged something which doesn’t compile on the platform it’s support to fix things for is a red flag, and makes me feel that if we want to spend effort on this again, it should also come with CI that can detect regressions with it in the future, or other reasons to be confident it actually works and keeps working.

  29. hugohn commented at 4:07 am on February 22, 2025: contributor
    I agree completely, @sipa. From a process perspective, it might be worth adopting a policy that any code changes not covered by CI must be verified on the relevant hardware—or at least a suitable emulator—before merging. In this particular case, the original changes were tested, but the subsequent refactor was not. Had it been tested, we would have caught the compile error.
  30. TheCharlatan referenced this in commit f78a01a560 on Feb 22, 2025
  31. maflcko commented at 1:04 pm on February 22, 2025: member

    I don’t think it is (trivially) possible to have a CI check for the fallback code without running the CI on the erroneous chip. However, I am not sure if it is worth it to integrate (and maintain) the erroneous chip into the CI in this repo just to check code that is added and never modified (apart from maybe minor refactorings every couple of years). I wouldn’t recommend to blindly trust CI anyway, so local testing of the feature should be done either way. (And if no one can be bothered to test locally, then there is clearly not enough interest in the feature anyway …)

    As for the other suggestion about testing non-erroneous chips: This can certainly be done, but the check could likely only detect successful compilation (maybe also that it doesn’t hang, or the the debug log is printed correctly). I am already running such a check in my own CI, so any issues will be found before a release of Bitcoin Core, but if people think that it is worth it to add to this repo as well, it can be done. However, if the CI is added because neither the pull request author, nor any reviewers can be bothered to even compile the code, then maybe the feature isn’t important enough to have a dedicated CI task for it. But no strong opinion. If someone wants to add such a CI task, it could probably be done with a slimmed-down build of just -DBUILD_UTIL_CHAINSTATE=ON and then running it with debug logging enabled (or similar).

  32. hugohn commented at 2:33 am on February 27, 2025: contributor

    if no one can be bothered to test locally

    Maybe I’m reading too much into it, but it feels like there’s a bit of a blame game going on in this thread, which doesn’t seem constructive. People did test the changes more than once. Let me share my own post-mortem, in case it helps. (If I misread the tone, I apologize.)

    There were four main actors in this incident:

    Reporter My colleague at Nunchuk, @giahuy98, investigated this issue after several users reported they couldn’t open our app on the S25 and OnePlus 13. He then filed #31817.

    Developer Within one day of our report, @eval-exec submitted a fix. Kudos to Eval for that quick turnaround. (Note: we don’t know Eval personally.)

    Reviewers @sipa and @laanwj reviewed and approved the fix quickly. @laanwj also asked @giahuy98 to test on a real device, which he did within hours (and confirmed it worked). Kudos to both for the prompt reviews. Later, others requested stylistic changes, and @laanwj confirmed the fix still worked on Amazon Graviton 3. @achow101 then pointed out duplicated code (thanks @achow101) and requested a refactor (discussion link). @eval-exec did the refactor, but unfortunately it introduced a compile error that went unnoticed. Afterward, @sipa and @achow101 utACK-ed the change, and then it was merged.

    Maintainers Ultimately, @achow101 merged the PR containing the untested refactor.

    From my perspective, all of us could have done better:

    Reporter: We could have tracked the changes more closely and tested each new commit, even for stylistic changes and without being explicitly asked. Developer: A quick local compile test might have caught the error introduced during refactoring. Reviewers: Could have emphasized that the code change was untested and requested verification before approving. Maintainers: Might have held off merging until confirming the final commit had actually been tested.

    For our part, since we haven’t interacted with this repo in a long time, I apologize if we didn’t fully follow the current process. Next time, we’ll be more attentive when filing issues or submitting PRs.

    Now, regarding whether to revert the RNDR/RNDRRS feature or keep it:

    • If we keep it, as @sipa mentioned, we accept that we’ll need to address hardware-specific issues as they arise.
    • If we don’t keep it, we can rely on other RNG sources entirely.

    Any other approach seems logically inconsistent. Just my two satoshis. Thanks everyone for the discussion.

  33. maflcko commented at 1:09 pm on February 27, 2025: member

    I tested the 28.0 release and it looks like the code isn’t cross-compiled into the release binary. I guess defining HWCAP2_RNG requires a specific cross-compile target?

    So the only use case for now would be self-compilation in a specific cloud pod in AWS or GCE (or possibly on some specific Laptops where the user installed Linux over Windows and did self-compilation)?

    If so, maybe c8a883a4123ab466f9d911ee2865625ba5314c9e could be temporarily reverted and the bug be reported upstream to Linux/etc. Once it is fixed and the fix is propagated enough, the feature could be re-added (possibly even without the workaround for the broken chip)?


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-04-03 09:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me