Gitian builds fail non-deterministically #17323

issue MarcoFalke opened this issue on October 30, 2019
  1. MarcoFalke commented at 5:39 PM on October 30, 2019: member

    arch: amd64 gitian virtualization: docker and podman host OS: Fedora and Ubuntu Bionic gitian builder at commit: https://github.com/devrandom/gitian-builder/commit/9b28e9c990eb0e98fb23573903791c75fdee3db1

    Log: win-build.log

  2. MarcoFalke added the label Bug on Oct 30, 2019
  3. MarcoFalke added the label Build system on Oct 30, 2019
  4. MarcoFalke added the label Needs gitian build on Oct 30, 2019
  5. MarcoFalke commented at 5:44 PM on October 30, 2019: member

    The Ubuntu Bionic machine uses 2 threads to compile and the Fedora machine uses 9 threads.

    First seen here: #16667 (comment)

  6. laanwj commented at 5:57 PM on October 30, 2019: member

    Apparently it's missing the univalue symbols during link. It did however build univalue, and create libunivalue.la.

  7. MarcoFalke commented at 7:30 PM on October 30, 2019: member

    I will try using one thread to see if the issue is related to that

    Times passed with one thread: 3

  8. MarcoFalke commented at 2:37 AM on October 31, 2019: member

    Ok, I built with one thread and it failed with a different linker error this time: win-build.log

  9. MarcoFalke commented at 2:39 AM on October 31, 2019: member

    Will try to build rc3 tomorrow and see if the hashes match

  10. MarcoFalke renamed this:
    Windows gitian builds fail non-deterministically
    Gitian builds fail non-deterministically
    on Nov 1, 2019
  11. MarcoFalke commented at 12:48 AM on November 1, 2019: member

    On master, linux and windows are affected now

  12. laanwj commented at 9:25 AM on November 1, 2019: member

    I'll try a bit too.

    Ok, I built with one thread and it failed with a different linker error this time:

    Same kind of error though (unreferenced symbols for a library that seems to have been built). So it affects secp256k1 too.

    Apparent data race? is definitely unexpected in case of one thread.

  13. MarcoFalke commented at 3:51 PM on November 1, 2019: member

    Same error for 7358ae6d71cd0e5908a1203b61cd4e54fe4af5de (rc3), 9 threads, so it is not related to #16667

    linux-build.log

  14. laanwj commented at 3:57 PM on November 1, 2019: member

    Haven't been able to reproduce it yet (debian, LXC, 6 threads, building master for linux/windows/macosx in a loop).

    linux-build.log

    Even standard std::__cxx11::basic_string symbols are missing there. Seems like the linker is botched.

    There's also: /usr/bin/ld: i386:x86-64 architecture of input file `univalue/.libs/libunivalue.a(libunivalue_la-univalue_get.o)' is incompatible with i386 output

    Could be that there's sometimes files left behind that interfere with the build?

  15. MarcoFalke commented at 5:31 PM on November 1, 2019: member

    I might just switch to guix builds. @dongcarl wen guix, sir?

  16. laanwj commented at 11:10 AM on November 2, 2019: member

    FWIW I've been running head-to-tail gitian builds of master for all OSes for the entire night, and haven't had a single failure.

    Can you try to bisect this?

    I might just switch to guix builds.

    Yes, would be interesting to see if that solves it. Whether it works or not, it will help isolate the issue.

  17. dongcarl assigned dongcarl on Nov 4, 2019
  18. dongcarl commented at 4:32 PM on November 4, 2019: member

    @MarcoFalke What's the best commit and os for me to reproduce this? Going to try running Guix and seeing if we get the same problem.

  19. MarcoFalke commented at 4:41 PM on November 4, 2019: member

    I will produce signed results for rc3 the old way (gbuild) to rule out it is not one of my wrapper scripts, which wrap the ./contrib/gitian-build.py wrapper script, which wraps gbuild, cause this.

  20. MarcoFalke commented at 7:15 PM on November 4, 2019: member

    DrahtBot ran with: arch: amd64 OS: Ubuntu Bionic docker: vanilla ubuntu package

    I ran with: arch: amd64 os: fedora 30 podman: 1.4.4-4 (https://koji.fedoraproject.org/koji/buildinfo?buildID=1314654) gitian: 9b28e9c990eb0e98fb23573903791c75fdee3db1 VERSION=0.19.0rc3

    $ ./bin/gbuild --num-make 9 --memory 9000 --commit bitcoin=v${VERSION} ../bitcoin/contrib/gitian-descriptors/gitian-win.yml
    

    and it fails as follows:

    --- Building for bionic amd64 ---
    Stopping target if it is up
    =
    podman container stop gitian-target
    =
    =
    podman container rm gitian-target
    =
    Making a new image copy
    Starting target
    Checking if target is up=
    podman run -d --name gitian-target base-bionic-amd64:latest
    =
    .
    =
    podman exec -u ubuntu -i gitian-target true
    =
    Preparing build environment
    =
    podman exec -u ubuntu -i gitian-target setarch x86_64 bash
    =
    =
    podman exec -u ubuntu gitian-target mkdir -p /home/ubuntu/cache/
    =
    =
    podman cp cache/bitcoin-core-win-0.19/ gitian-target:/home/ubuntu/cache/
    =
    =
    podman exec -u root gitian-target chown -R ubuntu:ubuntu /home/ubuntu/cache/
    =
    =
    podman exec -u ubuntu gitian-target mkdir -p /home/ubuntu/cache/
    =
    =
    podman cp cache/common/ gitian-target:/home/ubuntu/cache/
    =
    =
    podman exec -u root gitian-target chown -R ubuntu:ubuntu /home/ubuntu/cache/
    =
    Updating apt-get repository (log in var/install.log)
    Installing additional packages (log in var/install.log)
    =
    podman exec -u root -i gitian-target [ ! -e /var/cache/gitian/initial-upgrade ]
    =
    Upgrading system, may take a while (log in var/install.log)
    Creating package manifest
    =
    podman exec -u root -i gitian-target bash
    =
    Creating build script (var/build-script)
    =
    podman exec -u ubuntu gitian-target mkdir -p /home/ubuntu/build/
    =
    =
    podman cp inputs/bitcoin gitian-target:/home/ubuntu/build/
    =
    =
    podman exec -u root gitian-target chown -R ubuntu:ubuntu /home/ubuntu/build/
    =
    Running build script (log in var/build.log)
    Traceback (most recent call last):
    	6: from ./bin/gbuild:307:in `<main>'
    	5: from ./bin/gbuild:307:in `each'
    	4: from ./bin/gbuild:309:in `block in <main>'
    	3: from ./bin/gbuild:309:in `each'
    	2: from ./bin/gbuild:314:in `block (2 levels) in <main>'
    	1: from ./bin/gbuild:164:in `build_one_configuration'
    ./bin/gbuild:21:in `system!': failed to run on-target setarch x86_64 bash -x < var/build-script > var/build.log 2>&1 (RuntimeError)
    

    build.log

  21. MarcoFalke commented at 7:16 PM on November 4, 2019: member

    Can you try to bisect this?

    This will be hard because it is non-deterministic

  22. MarcoFalke commented at 7:17 PM on November 4, 2019: member

    (The above failure with vanilla gbuild took 4 tries)

  23. laanwj commented at 8:00 AM on November 5, 2019: member

    Looks like you've tried different machines, different host OS, built for different architectures, tried different numbers of threads. The only commonality seems to be the use of gitian with docker.

    Alternatively, a change in Bionic (the guest OS) triggered this. I'll regenerate the base image and see if it starts happening.

    I wonder, how to debug a non-deterministic linker failure? Is it possible to inspect the state of a VM after it happens? Maybe comparing some of the .o's and .a's with those produced in a successful run could explain a thing.

  24. MarcoFalke commented at 12:11 PM on November 5, 2019: member

    Bisecting this (eta is next week):

    BAD: 0.19.0rc3 BAD: 085cac6b90 BAD: 81f732bcaa BAD: 1a8a5ede9f GOOD (?): 0.18.1

    Edit: I used the gitian descriptor from 0.19.0rc3, so here the bisect goes again with the gitian descriptors from the respctive tag:

    BAD: 0.19.0rc3 RUNNING: 1a8a5ede9f

  25. MarcoFalke commented at 12:13 PM on November 5, 2019: member

    I wonder, how to debug a non-deterministic linker failure? Is it possible to inspect the state of a VM after it happens? Maybe comparing some of the .o's and .a's with those produced in a successful run could explain a thing.

    I believe the container should still be running. I will try to upload a full dump next time I see the issue.

  26. MarcoFalke commented at 2:45 PM on November 5, 2019: member
  27. MarcoFalke commented at 2:49 PM on November 5, 2019: member

    I type make V=1, which shouldn't have the gcc and linker wrappers in its PATH and it gives me:

    root@625be49b673e:/home/ubuntu/build/bitcoin/distsrc-x86_64-w64-mingw32# make V=1  
    Making all in src
    make[1]: Entering directory '/home/ubuntu/build/bitcoin/distsrc-x86_64-w64-mingw32/src'
    make[2]: Entering directory '/home/ubuntu/build/bitcoin/distsrc-x86_64-w64-mingw32/src'
    /bin/bash ../libtool  --tag=CXX --preserve-dup-deps  --mode=link x86_64-w64-mingw32-g++ -std=c++11  -fstack-reuse=none -Wstack-protector -fstack-protector-all      -fPIE -pipe -O2 -O2 -g -fvisibility=hidden -Wl,--exclude-libs,ALL -pthread  -Wl,--dynamicbase -Wl,--nxcompat -Wl,--high-entropy-va -pie    -all-static -L/home/ubuntu/build/bitcoin/depends/x86_64-w64-mingw32/share/../lib  -o bitcoind.exe bitcoind-bitcoind.o bitcoind-res.o libbitcoin_server.a libbitcoin_wallet.a libbitcoin_server.a libbitcoin_common.a univalue/libunivalue.la libbitcoin_util.a libbitcoin_zmq.a libbitcoin_consensus.a crypto/libbitcoin_crypto_base.a crypto/libbitcoin_crypto_sse41.a crypto/libbitcoin_crypto_avx2.a crypto/libbitcoin_crypto_shani.a leveldb/libleveldb.a leveldb/libleveldb_sse42.a leveldb/libmemenv.a secp256k1/libsecp256k1.la -L/home/ubuntu/build/bitcoin/depends/x86_64-w64-mingw32/share/../lib -lboost_system-mt-s-x64 -lboost_filesystem-mt-s-x64 -lboost_thread-mt-s-x64 -lboost_chrono-mt-s-x64 -ldb_cxx-4.8 -lcrypto -lminiupnpc  -levent -lzmq -lQt5AccessibilitySupport -lQt5DeviceDiscoverySupport -lQt5FbSupport -lQt5ThemeSupport -lQt5EventDispatcherSupport -lQt5FontDatabaseSupport -lssp -lcrypt32 -liphlpapi -lshlwapi -lmswsock -lws2_32 -ladvapi32 -lrpcrt4 -luuid -loleaut32 -lole32 -lcomctl32 -lshell32 -lwinmm -lwinspool -lcomdlg32 -lgdi32 -luser32 -lkernel32 -lmingwthrd 
    libtool: link: x86_64-w64-mingw32-g++ -std=c++11 -fstack-reuse=none -Wstack-protector -fstack-protector-all -fPIE -pipe -O2 -O2 -g -fvisibility=hidden -Wl,--exclude-libs -Wl,ALL -pthread -Wl,--dynamicbase -Wl,--nxcompat -Wl,--high-entropy-va -pie -static -o bitcoind.exe bitcoind-bitcoind.o bitcoind-res.o  -L/home/ubuntu/build/bitcoin/depends/x86_64-w64-mingw32/share/../lib libbitcoin_server.a libbitcoin_wallet.a libbitcoin_server.a libbitcoin_common.a univalue/.libs/libunivalue.a libbitcoin_util.a libbitcoin_zmq.a libbitcoin_consensus.a crypto/libbitcoin_crypto_base.a crypto/libbitcoin_crypto_sse41.a crypto/libbitcoin_crypto_avx2.a crypto/libbitcoin_crypto_shani.a leveldb/libleveldb.a leveldb/libleveldb_sse42.a leveldb/libmemenv.a secp256k1/.libs/libsecp256k1.a -lboost_system-mt-s-x64 -lboost_filesystem-mt-s-x64 -lboost_thread-mt-s-x64 -lboost_chrono-mt-s-x64 -ldb_cxx-4.8 -lcrypto -lminiupnpc -levent -lzmq -lQt5AccessibilitySupport -lQt5DeviceDiscoverySupport -lQt5FbSupport -lQt5ThemeSupport -lQt5EventDispatcherSupport -lQt5FontDatabaseSupport -lssp -lcrypt32 -liphlpapi -lshlwapi -lmswsock -lws2_32 -ladvapi32 -lrpcrt4 -luuid -loleaut32 -lole32 -lcomctl32 -lshell32 -lwinmm -lwinspool -lcomdlg32 -lgdi32 -luser32 -lkernel32 -lmingwthrd -pthread
    libbitcoin_server.a(libbitcoin_server_a-chain.o): In function `operator()':
    /home/ubuntu/build/bitcoin/distsrc-x86_64-w64-mingw32/src/interfaces/chain.cpp:219: undefined reference to `UniValue::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const'
    /home/ubuntu/build/bitcoin/distsrc-x86_64-w64-mingw32/src/interfaces/chain.cpp:220: undefined reference to `UniValue::get_int() const'
    libbitcoin_server.a(libbitcoin_server_a-init.o): In function `BlockNotifyGenesisWait':
    /home/ubuntu/build/bitcoin/distsrc-x86_64-w64-mingw32/src/init.cpp:606: undefined reference to `std::condition_variable::notify_all()'
    libbitcoin_server.a(libbitcoin_server_a-init.o): In function `OnRPCStopped':
    /home/ubuntu/build/bitcoin/distsrc-x86_64-w64-mingw32/src/init.cpp:356: undefined reference to `std::condition_variable::notify_all()'
    libbitcoin_server.a(libbitcoin_server_a-init.o):/usr/lib/gcc/x86_64-w64-mingw32/7.3-posix/include/c++/thread:126: undefined reference to `std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)())'
    libbitcoin_server.a(libbitcoin_server_a-init.o): In function `BlockNotifyCallback':
    /home/ubuntu/build/bitcoin/distsrc-x86_64-w64-mingw32/src/init.cpp:591: undefined reference to `std::thread::detach()'
    libbitcoin_server.a(libbitcoin_server_a-init.o): In function `__tcf_18':
    ...
    
  28. laanwj commented at 3:57 PM on November 5, 2019: member

    Ok, I examined the archive and checked:

    • build/bitcoin/distsrc-x86_64-w64-mingw32/src/univalue/.libs/libunivalue.a has a symbols dictionary
    • symbol UniValue::get_int() is in there (just picked an example one)
    nm -s -g --demangle  ./build/bitcoin/distsrc-x86_64-w64-mingw32/src/univalue/.libs/libunivalue.a| grep "UniValue::get_int() const"
    UniValue::get_int() const in libunivalue_la-univalue_get.o
    0000000000000630 T UniValue::get_int() const
    
    • I extracted the .o file using ar x and checked inside:
    0000000000000630 T UniValue::get_int() const
    
    • However, readelf -h shows:
    ELF Header:
      Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
      Class:                             ELF64
      Data:                              2's complement, little endian
      Version:                           1 (current)
      OS/ABI:                            UNIX - System V
      ABI Version:                       0
      Type:                              REL (Relocatable file)
      Machine:                           Advanced Micro Devices X86-64
    
    • wait, what: x86-64?!?!?

    • wait, what: ELF!

    • let's check one of bitcoin's own files

    $ readelf -h ./build/bitcoin/distsrc-x86_64-w64-mingw32/src/libbitcoin_server_a-chain.o
    readelf: Error: Not an ELF file - it has the wrong magic bytes at the start
    

    It looks like the univalue lib was compiled for x86_64 linux. not x86_32 x86_64 windows! this explains why the linking doesn't work, at least

    These are ELF:

    ./src/secp256k1/gen_context.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), with debug_info, not stripped
    ./src/univalue/lib/libunivalue_la-univalue_write.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), with debug_info, not stripped
    ./src/univalue/lib/libunivalue_la-univalue.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), with debug_info, not stripped
    ./src/univalue/lib/libunivalue_la-univalue_get.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), with debug_info, not stripped
    ./src/univalue/lib/libunivalue_la-univalue_read.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), with debug_info, not stripped
    

    (the secp256k1 file is correct: gen_context is for the build host, not for the target) The rest is COFF at least:

    ./src/libbitcoin_common_a-compressor.o: Intel amd64 COFF object file, no line number info, not stripped, 20 sections, symbol offset=0x2ba8c, 64 symbols
    ./src/node/libbitcoin_server_a-psbt.o: Intel amd64 COFF object file, no line number info, not stripped, 60 sections, symbol offset=0xbb4a8, 170 symbols
    …
    

    Haven't been able to find out why yet. But it fails for univalue subtree, apparently.

  29. laanwj commented at 4:16 PM on November 5, 2019: member

    HRM, looks like univalue's configure does not use cross-compiler CC:

    [build/bitcoin/distsrc-x86_64-w64-mingw32/src/univalue/Makefile]

    CC = gcc
    CCDEPMODE = depmode=none
    ac_ct_CC = gcc
    CXX = g++
    CXXCPP = g++ -E
    CXXDEPMODE = depmode=none
    CXXFLAGS = -O2 -g
    

    looking at config.log it's not configured for cross-compiling at all:

      $ ./configure --disable-option-checking --prefix=/ --disable-ccache --disable-maintainer-mode --disable-dependency-tracking --enable-r
    educe-exports --disable-bench --disable-gui-tests CFLAGS=-O2 -g CXXFLAGS=-O2 -g --disable-shared --with-pic --with-bignum=no --enable-mo
    dule-recovery --disable-jni --cache-file=/dev/null --srcdir=. --no-create --no-recursion
    …
    configure:2934: checking build system type
    configure:2948: result: x86_64-pc-linux-gnu
    configure:2968: checking host system type
    configure:2981: result: x86_64-pc-linux-gnu
    
    • CONFIG_SITE=/home/ubuntu/build/bitcoin/depends/x86_64-w64-mingw32/share/config.site wasn't passed through to the child configure script, so it doesn't pick up the cross-build configuration
  30. MarcoFalke commented at 4:42 PM on November 5, 2019: member

    Ok, thanks finding the issue. I will keep bisecting then

  31. dongcarl commented at 5:08 PM on November 5, 2019: member

    Ok, I built with one thread and it failed with a different linker error this time: win-build.log

    I read the logs here, and the problem appears to come from when we ./config.status --recheck in secp256k1.

    You'll see that on L1505, we configure distsrc-x86_64-w64-mingw32/src/secp256k1 once, and it seems to be getting the right system types...

    === configuring in src/secp256k1 (/home/ubuntu/build/bitcoin/distsrc-x86_64-w64-mingw32/src/secp256k1)
    configure: running /bin/bash ./configure --disable-option-checking '--prefix=/'  '--disable-ccache' '--disable-maintainer-mode' '--disable-dependency-tracking' '--enable-reduce-exports' '--disable-bench' '--disable-gui-tests' 'CFLAGS=-O2 -g' 'CXXFLAGS=-O2 -g' '--disable-shared' '--with-pic' '--enable-benchmark=no' '--with-bignum=no' '--enable-module-recovery' '--disable-jni' --cache-file=/dev/null --srcdir=.
    configure: loading site script /home/ubuntu/build/bitcoin/depends/x86_64-w64-mingw32/share/config.site
    checking build system type... x86_64-pc-linux-gnu
    checking host system type... x86_64-w64-mingw32
    

    but later on on L1701, we run ./config.status --recheck in secp256k1 distsrc-x86_64-w64-mingw32/src/secp256k1 and we start getting problems:

    /bin/bash ./config.status --recheck
    running CONFIG_SHELL=/bin/bash /bin/bash ./configure --disable-option-checking --prefix=/ --disable-ccache --disable-maintainer-mode --disable-dependency-tracking --enable-reduce-exports --disable-bench --disable-gui-tests CFLAGS=-O2 -g CXXFLAGS=-O2 -g --disable-shared --with-pic --enable-benchmark=no --with-bignum=no --enable-module-recovery --disable-jni --cache-file=/dev/null --srcdir=. --no-create --no-recursion
    checking build system type... x86_64-pc-linux-gnu
    checking host system type... x86_64-pc-linux-gnu
    
  32. laanwj commented at 6:36 PM on November 5, 2019: member

    That's a good find—--recheck is not part of any of the descriptors at least. It shoudln't be there. Could something intermittently be causing a recheck?

  33. MarcoFalke commented at 8:36 PM on November 5, 2019: member

    Ok, 0.18.1 (or rather 1a8a5ede9f) also fails. Note that the corresponding gitian descriptor is properly checked out:

    $ export COMMIT=1a8a5ede9f && (cd ../bitcoin && git checkout $COMMIT) && while ./bin/gbuild --num-make 9 --memory 9000 --commit bitcoin=$COMMIT ../bitcoin/contrib/gitian-descriptors/gitian-win.yml; do bash -c "echo \"one more success for $COMMIT\" >> /tmp/g_t"; done
    

    install.log build.log

  34. MarcoFalke commented at 8:43 PM on November 5, 2019: member

    sha256sum /home_ubuntu.zip c56059bc8914b430226c310872201afe4fee051a008fed76afe2dd624a5e7013 /home_ubuntu.zip

    https://send.firefox.com/download/e152d58f4a175568/#T6fxKWupIGNR7bn26d4EqQ

  35. MarcoFalke commented at 1:29 PM on November 12, 2019: member

    Now I can't even build depends:

    commit: 25c136d30e linux-build.log

  36. dongcarl commented at 1:30 PM on November 13, 2019: member

    What's interesting here is that jonasschnelli's nightly builds are still working properly: https://bitcoin.jonasschnelli.ch/?show=nightly#nighly

  37. MarcoFalke commented at 1:34 PM on November 13, 2019: member

    I've installed a 19.10 Ubuntu Box with docker and the gitian builds work fine in there.

  38. dongcarl commented at 1:38 PM on November 13, 2019: member

    I've installed a 19.10 Ubuntu Box with docker and the gitian builds work fine in there.

    For the builds that failed... Did you git clean -xdff after each try?

  39. MarcoFalke commented at 1:54 PM on November 13, 2019: member

    There are two git directories: One where the gitian descriptors are drawn from, and the other where the code is drawn from. I ran git clean -dffx on both and got on the one with the gitian descriptors:

    $ git clean -dffx 
    Removing depends/work/
    

    No idea how this folder got there, nor how it could affect builds. Will retry now.

  40. MarcoFalke commented at 3:28 PM on February 22, 2020: member

    No longer seeing this

  41. MarcoFalke closed this on Feb 22, 2020

  42. MarcoFalke removed the label Needs gitian build on Mar 19, 2020
  43. DrahtBot locked this on Feb 15, 2022

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-17 06:14 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me