ci: Use qemu-user through container engine #28087

pull maflcko wants to merge 1 commits into bitcoin:master from maflcko:2307-ci-qemu- changing 9 files +18 −64
  1. maflcko commented at 10:15 am on July 17, 2023: member

    Currently the CI containers always run on the host architecture, and only wrap bitcoind into qemu-user when needed. This has many issues:

    • The i386 tasks can not be run on non-x86 hosts.
    • config.guess isn’t present when building the CI image, which is fine. But it prints a warning, see #27739#pullrequestreview-1446580353
    • The python tests are run on the host architecture, making it harder to find architecture specific bugs. See for example #27529 (comment)
    • All modern container engines support automatic dispatch to qemu-user, so it seems redundant to re-invent the wheel.

    Fix all issues by:

    • removing HOST from ci/test/00_setup_env.sh.
    • removing QEMU_USER_CMD and ci/test/wrap-qemu.sh.
    • removing DPKG_ADD_ARCH where possible, and pruning PACKAGES where possible.
    • specifying the architecture in CI_IMAGE_NAME_TAG to be used by the container engine.
  2. DrahtBot commented at 10:15 am on July 17, 2023: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK fanquake
    Concept ACK willcl-ark

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #28210 (build: Bump minimum supported Clang to clang-13 by MarcoFalke)
    • #28185 (ci: Use hard-coded root path for CI containers (bugfix) by MarcoFalke)
    • #27976 (ci: Start with clean env by MarcoFalke)
    • #21652 (ci: Switch more tasks to self-hosted by MarcoFalke)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  3. DrahtBot added the label Tests on Jul 17, 2023
  4. maflcko force-pushed on Jul 17, 2023
  5. maflcko commented at 11:42 am on July 17, 2023: member
    If it doesn’t work locally, you can check the enabled architectures via ls /proc/sys/fs/binfmt_misc/
  6. maflcko force-pushed on Jul 17, 2023
  7. maflcko commented at 3:28 pm on July 17, 2023: member
    Looks like it is not set up on Cirrus CI: https://cirrus-ci.com/task/6369879847075840?logs=build#L42 ?
  8. maflcko commented at 3:29 pm on July 17, 2023: member
    I may try using a full CI VM next.
  9. maflcko marked this as a draft on Jul 19, 2023
  10. maflcko marked this as ready for review on Jul 24, 2023
  11. maflcko force-pushed on Jul 24, 2023
  12. maflcko commented at 8:08 am on July 24, 2023: member
    The arm task is now run by @DrahtBot, which should make it green
  13. DrahtBot added the label CI failed on Jul 24, 2023
  14. maflcko force-pushed on Jul 25, 2023
  15. maflcko commented at 8:48 am on July 25, 2023: member
    Looks like calling the armhf (cross) compiler (which happens often) via qemu-aarch64 on amd64 is too much overhead and times out? I’ll try to run on aarch64 metal directly, which should avoid the extra hop through qemu in the CI.
  16. maflcko force-pushed on Jul 25, 2023
  17. maflcko force-pushed on Jul 25, 2023
  18. maflcko force-pushed on Jul 25, 2023
  19. DrahtBot removed the label CI failed on Jul 25, 2023
  20. maflcko commented at 3:41 pm on July 25, 2023: member

    Looks like CI is now green :)

    Also the test failure from #27529 (comment) can now be tested on this pull with just:

    0MAKEJOBS="-j$(nproc)" FILE_ENV="./ci/test/00_setup_env_s390x.sh" ./ci/test_run_all.sh
    

    Output:

     0 test  2023-07-25T15:37:42.430000Z TestFramework (ERROR): Unexpected exception caught during testing 
     1                                   Traceback (most recent call last):
     2                                     File "/root/b-c-ci/ci/scratch/build/bitcoin-s390x-ibm-linux-gnu/test/functional/test_framework/test_framework.py", line 131, in main
     3                                       self.run_test()
     4                                     File "/root/b-c-ci/ci/scratch/build/bitcoin-s390x-ibm-linux-gnu/test/functional/feature_addrman.py", line 68, in run_test
     5                                       self.start_node(0, extra_args=["-checkaddrman=1"])
     6                                     File "/root/b-c-ci/ci/scratch/build/bitcoin-s390x-ibm-linux-gnu/test/functional/test_framework/test_framework.py", line 537, in start_node
     7                                       node.wait_for_rpc_connection()
     8                                     File "/root/b-c-ci/ci/scratch/build/bitcoin-s390x-ibm-linux-gnu/test/functional/test_framework/test_node.py", line 235, in wait_for_rpc_connection
     9                                       raise FailedToStartError(self._node_msg(
    10                                   test_framework.test_node.FailedToStartError: [node 0] bitcoind exited with status 1 during initialization
    11 test  2023-07-25T15:37:42.466000Z TestFramework (DEBUG): Closing down network thread 
    12 test  2023-07-25T15:37:42.518000Z TestFramework (INFO): Stopping nodes 
    13 test  2023-07-25T15:37:42.519000Z TestFramework.node0 (DEBUG): Stopping node 
    14
    15 node0 stderr Error: Invalid or corrupt peers.dat (AutoFile::read: end of file: iostream error). If you believe this is a bug, please report it to https://github.com/bitcoin/bitcoin/issues. As a workaround, you can move the file ("/root/b-c-ci/ci/scratch/test_runner/test_runner_₿_🏃_20230725_152500/feature_addrman_40/node0/regtest/peers.dat") out of the way (rename, move, or delete) to have a new one created on the next start. 
    
  21. maflcko force-pushed on Jul 25, 2023
  22. fanquake requested review from theStack on Jul 26, 2023
  23. fanquake requested review from willcl-ark on Jul 26, 2023
  24. willcl-ark commented at 8:28 am on July 27, 2023: member

    I don’t think I can give this much competent review I’m afraid. Partly because I can’t get it working locally, and partly because I don’t feel like I understand the changes fully yet.

    I did try to run MAKEJOBS="-j$(nproc)" FILE_ENV="./ci/test/00_setup_env_s390x.sh" ./ci/test_run_all.sh, but it takes more than 2 hours to run (amd64 local arch), and I keep getting (possibly) unrelated errors like this: 390x_build_error.txt

    I do have some questions for my own understanding though…

    Am I right in understanding that in the new setup, docker is using qemu and binfmt-misc from the container engines’ kernel to run (emulate) e.g. the "docker.io/s390x/debian:bookworm" image, and that this is roughly the same (but simpler to configure from our side) than running an x86 image, manually installing qemu, and then wrapping our various binaries in it. But it also achieves “The python tests are run on the host architecture, making it harder to find architecture specific bugs”?

  25. maflcko commented at 8:55 am on July 27, 2023: member

    but it takes more than 2 hours to run

    Yes, this is expected. Instead of just running bitcoind through qemu, now the whole container is run through qemu.

    unrelated errors

    Jup, this should be unrelated, I presume it also happens on current master. You may be able to fix it by upgrading your qemu-s390x --version. IIRC qemu 7.2 or later is required.

    this is roughly the same (but simpler to configure from our side) than running an x86 image, manually installing qemu, and then wrapping our various binaries in it. But it also achieves “The python tests are run on the host architecture, making it harder to find architecture specific bugs”?

    Yeah, it requires a minimal setup procedure on your container engine and it may run slower at runtime. However, the result is otherwise,

    • simpler, less code;
    • more flexible, supporting a guest and host of any architecture combination; (:rocket:)
    • more efficient in the happy case by default, because the container engine will detect if the guest and host architecture matches, and sidestep the expensive qemu.
    • more complete, detecting more bugs, as all executables in the container are run in qemu. (:rocket:)

    (I’ve added rocket emojis to the main features/bugfixes I am interested in)

  26. willcl-ark commented at 9:00 am on July 27, 2023: member

    Thanks Marco, thats a very helpful explanation (and I nice changes for us!).

    I will give it another go this afternoon after trying to update qemu.

  27. maflcko force-pushed on Jul 28, 2023
  28. willcl-ark commented at 9:43 am on July 31, 2023: member

    @MarcoFalke

    Yes, this is expected. Instead of just running bitcoind through qemu, now the whole container is run through qemu.

    Ok running this again… to setup and build Depends still takes more than an hour for me (with it taking 56 mins to progress past building qt) which seems relatively annoying. Am I correct that this build will use a depends cache on actual CI though, which should speed it up there significantly?

  29. maflcko commented at 9:58 am on July 31, 2023: member

    Depends still takes more than an hour for me

    Yes, there seems to be quite an overhead in calling qemu for each configure check, and each build file. This is really the cost that is paid in this pull, but I think it is worth it the benefits.

    Am I correct that this build will use a depends cache on actual CI though, which should speed it up there significantly?

    CI will run on aarch64 metal, so qemu will never be called. (depends is also cached, but this is happening regardless).

  30. willcl-ark commented at 10:55 am on July 31, 2023: member

    Unfortunately that build broke again for me for another seemingly unrealted reason. I have the correct version of qemu this time for sure:

    0will@ubuntu in ~
    1$ qemu-s390x --version
    2qemu-s390x version 7.2.0 (Debian 1:7.2+dfsg-5ubuntu2.2)
    3Copyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers
    

    I don’t think I can test this on my side, but I am a concept ACK on these changes, and they look correct to me.

  31. maflcko commented at 11:00 am on July 31, 2023: member
    Feel free to create an issue with the unrelated stuff you ran into. Happy to take a look. Personally, I spin up a fresh VM (in a cloud or locally) to do testing, so I likely miss most CI UX issues.
  32. maflcko requested review from hebasto on Jul 31, 2023
  33. maflcko force-pushed on Aug 4, 2023
  34. maflcko commented at 8:37 am on August 4, 2023: member
    Rebased, and added one more line of documentation
  35. maflcko force-pushed on Aug 4, 2023
  36. ci: Use qemu-user through container engine fad0b67c21
  37. DrahtBot added the label Needs rebase on Aug 7, 2023
  38. maflcko force-pushed on Aug 7, 2023
  39. maflcko commented at 3:45 pm on August 7, 2023: member
    rebased :smiling_face_with_tear:
  40. DrahtBot removed the label Needs rebase on Aug 7, 2023
  41. fanquake approved
  42. fanquake commented at 10:08 am on August 9, 2023: member

    ACK fad0b67c212dcb8a16fcbda5a74acc959ed4e284 - this seems ok to me, and removes complexity from our CI system.

    I’ve locally tested the ENVs changed here, and seen the associated slowdown / resource consumption. Had GCC segfault more than once when cross-compiling s390x.

  43. fanquake merged this on Aug 9, 2023
  44. fanquake closed this on Aug 9, 2023

  45. maflcko deleted the branch on Aug 9, 2023
  46. maflcko commented at 10:30 am on August 9, 2023: member

    Had GCC segfault more than once when cross-compiling s390x.

    I think this can be fixed by setting MAKEJOBS="-j1" FILE_ENV=... (or a higher value, depending on your memory size)

  47. sidhujag referenced this in commit c019823514 on Aug 9, 2023
  48. maflcko commented at 5:40 pm on August 17, 2023: member

    Had GCC segfault more than once when cross-compiling s390x.

    I also ran into this on aarch64, and I presume it is just the CPU overheating (or some unrelated bug).

  49. maflcko commented at 3:17 pm on August 27, 2023: member

    Had GCC segfault more than once when cross-compiling s390x.

    I also ran into this on aarch64, and I presume it is just the CPU overheating (or some unrelated bug).

    It seems it only happens with qemu-s390x on aarch64, and not with e.g. qemu-riscv64 on the same aarch64. So it seems a bug with qemu-s390x.

  50. bitcoin locked this on Dec 3, 2024

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-22 06:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me