Revert merge of PR #31826 #31908

pull darosior wants to merge 1 commits into bitcoin:master from darosior:2502_revert_pr_31817 changing 1 files +11 −34
  1. darosior commented at 3:33 pm on February 19, 2025: member

    PR #31826 was merged despite the code not compiling.

    #31902 was opened to fix the code but since this code is only targeting a not officially supported platform, we don’t have a CI in place to compile and run tests on this platform, neither apparently reviewers do (nor does the author?), don’t take more risk right before 29 and revert the original broken PR.

  2. Revert "Merge bitcoin/bitcoin#31826: random: Check `GetRNDRRS` is supported in `InitHardwareRand` to avoid infinite loop"
    This reverts commit 139640079ff52de5a7935e98225da76686ef32cb, reversing
    changes made to dc3a71463373e07e1c8956173c039e61fecbf029.
    3e9b12b3e0
  3. DrahtBot commented at 3:33 pm on February 19, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31908.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK sipa, TheCharlatan, eval-exec, laanwj, achow101

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

  4. glozow added this to the milestone 29.0 on Feb 19, 2025
  5. glozow commented at 3:51 pm on February 19, 2025: member
    Up to reviewers to decide whether this or #31902 is more appropriate, but added to milestone so we do this before branch off.
  6. sipa commented at 4:01 pm on February 19, 2025: member

    ACK 3e9b12b3e0f039a8760410afed74c7e4d15afbe6

    We shouldn’t have merged #31826 if it didn’t compile, and apparently nobody actually tested it. It’s one thing to make changes for an unsupported platform that don’t hurt others, but clearly the bar for testing was too low if it actually doesn’t compile at all.

  7. dergoegge commented at 4:05 pm on February 19, 2025: member

    neither apparently reviewers do (nor does the author?)

    It was actually tested on the merged PR: see #31826 (comment) and #31826 (comment) but a later refactoring suggestion (https://github.com/bitcoin/bitcoin/pull/31826#discussion_r1955192539) introduced the compilation issue and no re-testing was done before merge.

    I think if @laanwj and @giahuy98 are willing to re-test on #31902 we could consider merging the fix instead.

  8. TheCharlatan approved
  9. TheCharlatan commented at 4:06 pm on February 19, 2025: contributor
    ACK 3e9b12b3e0f039a8760410afed74c7e4d15afbe6
  10. eval-exec commented at 4:09 pm on February 19, 2025: contributor
  11. glozow requested review from laanwj on Feb 19, 2025
  12. laanwj commented at 4:51 pm on February 19, 2025: member

    We shouldn’t have merged #31826 if it didn’t compile, and apparently nobody actually tested it. It’s one thing to make changes for an unsupported platform that don’t hurt others, but clearly the bar for testing was too low if it actually doesn’t compile at all.

    i did actually test it extensively, but yes it was changed and i hadn’t gotten around to reviewing or re-testing it yet.

  13. laanwj approved
  14. laanwj commented at 4:57 pm on February 19, 2025: member
    ACK 3e9b12b3e0f039a8760410afed74c7e4d15afbe6 i have checked that this is an exact revert of 09b150bb8adae00854f02ece69fc6ef222fb07d9
  15. achow101 commented at 5:20 pm on February 19, 2025: member

    since this code is only targeting a not officially supported platform

    Ostensibly we do though, as the code is part of the aarch64 platform which we do have CI for. However, it appears that our ARM CI does not have HWCAP2_RNG, but it could if we switched to using AWS Graviton instances.

  16. laanwj commented at 5:30 pm on February 19, 2025: member

    Ostensibly we do though, as the code is part of the aarch64 platform which we do have CI for. However, it appears that our ARM CI does not have HWCAP2_RNG, but it could if we switched to using AWS Graviton instances.

    Having the ARM64 RNG code in the first place makes sense, and we could test this on Graviton instances.

    But also agree that a workaround for an errata of specific SoC series is really on the edge of what is in scope for this project. Or user code in the first place, for that matter. Ideally a broken implementation of a feature is fixed in the kernel. Either by disabling the functionality on that specific SoC or another workaround like a microcode update or a hook for the instruction.

  17. achow101 commented at 6:01 pm on February 19, 2025: member
    ACK 3e9b12b3e0f039a8760410afed74c7e4d15afbe6
  18. achow101 merged this on Feb 19, 2025
  19. achow101 closed this on Feb 19, 2025

  20. jonatack commented at 6:11 pm on February 19, 2025: member

    (Why my approve button can’t click?

    I believe it was disabled for non-members of the repository because we were seeing too much spamming from that functionality.

  21. maflcko commented at 10:21 pm on February 19, 2025: member

    Ostensibly we do though, as the code is part of the aarch64 platform which we do have CI for. However, it appears that our ARM CI does not have HWCAP2_RNG, but it could if we switched to using AWS Graviton instances.

    Having the ARM64 RNG code in the first place makes sense, and we could test this on Graviton instances.

    I don’t think they have a 32-bit mode, so we’d have to drop that:

    https://github.com/bitcoin/bitcoin/blob/e606c577cb257d0927ca6ea939aab06cf0639aad/ci/test/00_setup_env_arm.sh#L9

    So if something is added, it would be a new task?

    Moreover, there are also GHA runners with HWCAP2_RNG set, but they are in beta and sometimes fail to start.

  22. hugohn commented at 8:28 am on February 20, 2025: contributor

    Just wanted to add that even though we discovered the issue on mobile devices (Samsung S25 and OnePlus 13), it looks like the same chip will be shipped in several laptop brands/models as well: https://www.google.com/search?q=laptop+snapdragon+x+elite

    My colleague @giahuy98 did test the original fix (https://github.com/bitcoin/bitcoin/pull/31826) on a real S25 device (it works), and @laanwj tested as well. Unfortunately, after that, there were further code changes (code refactoring) that did not get tested.

  23. laanwj commented at 9:42 am on February 20, 2025: member

    I don’t think they have a 32-bit mode, so we’d have to drop that:

    They seem to have Aarch32 support! To verify this i spun up a m7g (Graviton 3) instance, running Ubuntu:

    0$ uname -a
    1Linux ip-172-31-13-55 6.8.0-1021-aws [#23](/bitcoin-bitcoin/23/)-Ubuntu SMP Mon Dec  9 23:51:16 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
    2$ dpkg --add-architecture armhf
    3$ apt update
    4$ apt install libc6:armhf libstdc++6:armhf gcc-arm-linux-gnueabihf
    5$ arm-linux-gnueabihf-gcc test.c -o test
    6$ file test
    7test: ELF 32-bit LSB pie executable, ARM, EABI5 version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-armhf.so.3, BuildID[sha1]=285d66ef6592b9a8d2db6c174103d3d70b245ba8, for GNU/Linux 3.2.0, not stripped
    8$ ./test
    9hello world
    

    Just wanted to add that even though we discovered the issue on mobile devices (Samsung S25 and OnePlus 13), it looks like the same chip will be shipped in several laptop brands/models as well: https://www.google.com/search?q=laptop+snapdragon+x+elite

    Not exactly the same chip (X Elite versus 8 Elite). But t’s possible that they have the same problem. Would be good if someone with one of those laptops would test it. @achow101 suggested doing so in WSL might be easiest as they’re nominally Windows devices, and we don’t build Bitcoin Core for ARM64 Windows.

    To be clear my point is not that this isn’t worth fixing. The question is where.

    I think it’s ridiculous to have to implement a fix for a CPU issue in all end-user software. But it’s not totally unheard of. These kind of model-specific workarounds happen a lot in gamedev GPU programming, historically a wasteland of out-of-spec hardware and drivers, but it’s somewhat worrying to see it crossing to CPUs which you’d expect to have a higher standard.

    As said, ideally a fix would land in the kernel, as this would automatically fix userland. Linux support for these devices is still in the early stage, so it’s not unimaginable.

    But testing and ACKing the original PR says enough in that i’m also not against to carrying the change, if necessary. It’s not a big complication.

  24. maflcko commented at 10:32 am on February 20, 2025: member

    I don’t think they have a 32-bit mode, so we’d have to drop that:

    They seem to have Aarch32 support! To verify this i spun up a m7g (Graviton 3) instance, running Ubuntu:

    Thanks for checking. I guess we’d still have to drop the 32-bit cross compilation (to compile the code here), if we wanted to do it in CI (or add a separate task)? Though, I wonder if CI is a good way to test such changes, because CI doesn’t print the debug logs on success and part of the testing involves looking at debug logs.

    As a side note, for testing the happy-case, apart from m7g on aws, one can also use c4a on gcp, or the GHA arm runner (no idea what it is, but the docs are here: https://github.com/actions/partner-runner-images/blob/main/images/arm-ubuntu-24-image.md)

  25. laanwj commented at 10:49 am on February 20, 2025: member

    because CI doesn’t print the debug logs on success and part of the testing involves looking at debug logs.

    Right, to make it usefully tested in CI we’d need a specific test case for it, that checks the instruction set is successfully detected and used. Otherwise all you’re testing is “it doesn’t hang”. Which is, something at least.

  26. hugohn commented at 0:44 am on February 21, 2025: contributor

    To be clear my point is not that this isn’t worth fixing. The question is where.

    I think it’s ridiculous to have to implement a fix for a CPU issue in all end-user software. But it’s not totally unheard of. These kind of model-specific workarounds happen a lot in gamedev GPU programming, historically a wasteland of out-of-spec hardware and drivers, but it’s somewhat worrying to see it crossing to CPUs which you’d expect to have a higher standard.

    As said, ideally a fix would land in the kernel, as this would automatically fix userland. Linux support for these devices is still in the early stage, so it’s not unimaginable.

    But testing and ACKing the original PR says enough in that i’m also not against to carrying the change, if necessary. It’s not a big complication.

    I think one could argue that the fix should be made both in the kernel and in Core.

    The design of GetRandBytes() and GetStrongRandBytes() is about being robust and sourcing entropy from multiple sources, so that even if one or some of those sources fail, Core can still obtain sufficiently good entropy. There could be multiple reasons hardware entropy fails to return lower down in the stack.

    From Nunchuk’s perspective, we reuse Core code directly and have thousands of users on these mobile devices. We have no certainty about when or whether a kernel fix will land on these devices. If there’s no fix in Core itself, we’d have to carry our own patch in a separate fork, which would be brittle and unfortunate.

  27. hugohn commented at 1:08 am on February 21, 2025: contributor

    Some additional context: the PR that introduced RNDR/RNDRRS support for AArch64 (#26839) was merged about a year ago, and AFAIK it was only tested on Amazon Graviton 3 before merging—so it’s not like the feature itself has been extensively battle-tested across different SoCs.

    If someone feels this fix is too specific to a single chip and wants to revert it, the same reasoning could apply to the original PR itself. That might imply reverting the whole RNDR/RNDRRS feature, which doesn’t seem ideal given it works fine on other hardware and improves performance. Instead, I think it’s reasonable to fix it in Core and encourage a proper fix in the kernel, rather than remove the feature outright.

  28. sipa commented at 8:46 pm on February 21, 2025: member

    If we accept the fact that we have assembly RNDR/RNDRRS code in our project, I don’t think it’s unreasonable that this also means taking on the responsibility of dealing with weird hardware incompatibility issues it causes, in general. If we don’t want that, we can choose to not have this support at all and just rely on the kernel’s RNG.

    But having merged something which doesn’t compile on the platform it’s support to fix things for is a red flag, and makes me feel that if we want to spend effort on this again, it should also come with CI that can detect regressions with it in the future, or other reasons to be confident it actually works and keeps working.

  29. hugohn commented at 4:07 am on February 22, 2025: contributor
    I agree completely, @sipa. From a process perspective, it might be worth adopting a policy that any code changes not covered by CI must be verified on the relevant hardware—or at least a suitable emulator—before merging. In this particular case, the original changes were tested, but the subsequent refactor was not. Had it been tested, we would have caught the compile error.

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-02-22 06:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me