Unit test failures when using multiple jobs and RANDOM_CTX_SEED #30696

issue hodlinator openend this issue on August 22, 2024
  1. hodlinator commented at 2:41 pm on August 22, 2024: contributor

    Is there an existing issue for this?

    • I have searched the existing issues

    Current behaviour

    While testing #30571 using:

    0RANDOM_CTX_SEED=1 make -j16 check
    

    …I discovered that 97e16f57042cab07e5e73f6bed19feec2006e4f7 from #29625 leads to test failures within the first 10 seconds in blockfilter_index_tests/blockfilter_index_initial_sync, argsman_tests/util_datadir or addrman_tests (probably more).

    Running without the environment variable, or single-threaded, passes:

    0make -j16 check
    1RANDOM_CTX_SEED=1 make -j1 check
    

    Expected behaviour

    Should always work, as it did before 97e16f57042cab07e5e73f6bed19feec2006e4f7.

    Steps to reproduce

    Console command:

    0RANDOM_CTX_SEED=1 make -j8 check
    

    Relevant log output

    Excerpts:

    02024-08-22T14:07:48.334146Z (mocktime: 2020-08-31T15:34:12Z) [basic block filter index] [util/thread.cpp:20] [TraceThread] basic block filter index thread start
    12024-08-22T14:07:48.334255Z (mocktime: 2020-08-31T15:34:12Z) [basic block filter index] [flatfile.cpp:69] [Allocate] [validation] Pre-allocating up to position 0x100000 in fltr00000.dat
    22024-08-22T14:07:48.334479Z (mocktime: 2020-08-31T15:34:12Z) [basic block filter index] [index/base.cpp:206] [Sync] Syncing basic block filter index with block chain from height 0
    32024-08-22T14:07:48.345502Z (mocktime: 2020-08-31T15:34:12Z) [basic block filter index] [index/base.cpp:220] [Sync] basic block filter index is enabled at height 100
    42024-08-22T14:07:48.345523Z (mocktime: 2020-08-31T15:34:12Z) [basic block filter index] [util/thread.cpp:22] [TraceThread] basic block filter index thread exit
    52024-08-22T14:07:48.434324Z (mocktime: 2020-08-31T15:34:12Z) [test] [flatfile.cpp:44] [Open] Unable to open file /run/user/1001/test_common_Bitcoin Core/cd149241dcbf7d8e2e5ebaf5a96aafdab9953a29260dce4e20e44f11a4b6efa5/regtest/indexes/blockfilter/basic/fltr00000.dat
    6test/blockfilter_index_tests.cpp(46): error: in "blockfilter_index_tests/blockfilter_index_initial_sync": check filter_index.LookupFilter(block_index, filter) has failed
    72024-08-22T14:07:48.434406Z (mocktime: 2020-08-31T15:34:12Z) [test] [flatfile.cpp:44] [Open] Unable to open file /run/user/1001/test_common_Bitcoin Core/cd149241dcbf7d8e2e5ebaf5a96aafdab9953a29260dce4e20e44f11a4b6efa5/regtest/indexes/blockfilter/basic/fltr00000.dat
    8test/blockfilter_index_tests.cpp(48): error: in "blockfilter_index_tests/blockfilter_index_initial_sync": check filter_index.LookupFilterRange(block_index->nHeight, block_index, filters) has failed
    9test/blockfilter_index_tests.cpp(55): error: in "blockfilter_index_tests/blockfilter_index_initial_sync": check filter.GetHash() == expected_filter.GetHash() has failed [9a538906e6466ebd2617d321f71bc94e56056ce213d366773699e28158e00614 != b17ba25562f198a43d1fc5e4b7114ac5346c5dcc85b95ed19278bbfa30de301f]
    
     0test/argsman_tests.cpp(25): Entering test suite "argsman_tests"
     1test/argsman_tests.cpp(27): Entering test case "util_datadir"
     22024-08-22T14:21:36Z SeedRandomForTest: Setting random seed for current tests to RANDOM_CTX_SEED=0000000000000000000000000000000000000000000000000000000000000001
     32024-08-22T14:21:36.227485Z [test] [init/common.cpp:149] [LogPackageVersion] Bitcoin Core version v27.99.0-97e16f57042c (release build)
     42024-08-22T14:21:36.227525Z [test] [node/chainstatemanager_args.cpp:57] [ApplyArgsManOptions] Script verification uses 15 additional threads
     52024-08-22T14:21:36.227556Z [test] [kernel/context.cpp:18] [Context] Using the 'sse4(1way),sse41(4way),avx2(8way)' SHA256 implementation
     62024-08-22T14:21:36.227561Z [test] [random.cpp:114] [ReportHardwareRand] Using RdSeed as an additional entropy source
     72024-08-22T14:21:36.227563Z [test] [random.cpp:117] [ReportHardwareRand] Using RdRand as an additional entropy source
     82024-08-22T14:21:36.233950Z [test] [script/sigcache.cpp:104] [InitSignatureCache] Using 16 MiB out of 16 MiB requested for signature cache, able to store 524288 elements
     92024-08-22T14:21:36.239326Z [test] [validation.cpp:2108] [InitScriptExecutionCache] Using 16 MiB out of 16 MiB requested for script execution cache, able to store 524288 elements
    10test/argsman_tests.cpp(27): Leaving test case "util_datadir"; testing time: 12446us
    11test/argsman_tests.cpp(171): Entering test case "util_CheckValue"
    12test_bitcoin: test/util/setup_common.cpp:294: void ChainTestingSetup::LoadVerifyActivateChainstate(): Assertion `status == node::ChainstateLoadStatus::SUCCESS' failed.
    13make[3]: *** [Makefile:22560: test/argsman_tests.cpp.test] Error 1
    
    0Entering test module "Bitcoin Core Test Suite"
    1test/addrman_tests.cpp(62): Entering test suite "addrman_tests"
    2test/addrman_tests.cpp(64): Entering test case "addrman_simple"
    3terminate called after throwing an instance of 'std::filesystem::__cxx11::filesystem_error'
    4  what():  filesystem error: cannot remove all: No such file or directory [/run/user/1001/test_common_Bitcoin Core/cd149241dcbf7d8e2e5ebaf5a96aafdab9953a29260dce4e20e44f11a4b6efa5] [/run/user/1001/test_common_Bitcoin Core/cd149241dcbf7d8e2e5ebaf5a96aafdab9953a29260dce4e20e44f11a4b6efa5/regtest/blocks/rev00000.dat]
    5make[3]: *** [Makefile:22560: test/addrman_tests.cpp.test] Error 1
    

    How did you obtain Bitcoin Core

    Compiled from source

    What version of Bitcoin Core are you using?

    97e16f57042cab07e5e73f6bed19feec2006e4f7

    Operating system and version

    NixOS unstable (Linux)

    Machine specifications

    16 logical cores, intel SSD

  2. maflcko added the label Tests on Aug 22, 2024
  3. maflcko commented at 3:19 pm on August 22, 2024: member
    Nice fine. RANDOM_CTX_SEED is only really for debugging single unit tests after the fact that they failed, but it would be nice to fix this nonetheless.
  4. hodlinator commented at 1:19 pm on August 23, 2024: contributor
    Thinking about it some more, it makes sense that if we give all jobs the same random seed and 97e16f57042cab07e5e73f6bed19feec2006e4f7 makes the test folder generation deterministic, the jobs will trample over each other. If that is the case, it’s probably better bring back some non-determinism in the folder generation.
  5. maflcko commented at 1:38 pm on August 23, 2024: member

    Correct. The easiest fix would be to just use GetStrongRandBytes, however, I am not sure how slow that’d be in the worst case.

    Another possible fix would be to move the SeedRandomForTest after the fuzz target’s init function, but before ResetCoverageCounters. However that’d still break if the fuzz target spun up a TestingSetup.

    Another possible fix would be to instead use the hash of the current folder generator, as well as the hash of the pid. Which may obviously break once the kernel wraps the pids, but I am not sure how likely that is.

  6. hodlinator commented at 2:07 pm on August 23, 2024: contributor

    Top of setup_common.cpp is a bit damning:

    0/** Random context to get unique temp data dirs. Separate from g_insecure_rand_ctx, which can be seeded from a const env var */
    1static FastRandomContext g_insecure_rand_ctx_temp_path;
    

    Together with the comment added in #29625 in test/util/random.cpp (“GetRandHash is no longer truly random”):

    0    // Do this once, on the first call, regardless of seedtype, because once
    1    // MakeRandDeterministicDANGEROUS is called, the output of GetRandHash is
    2    // no longer truly random. It should be enough to get the seed once for the
    3    // process.
    

    Folder generation runs const auto rand_str{g_insecure_rand_ctx_temp_path.rand256().ToString()} which internally calls:

    0void FastRandomContext::fillrand(Span<std::byte> output) noexcept
    1{
    2    if (requires_seed) RandomSeed();
    3    rng.Keystream(output);
    4}
    

    ..which uses the now deterministic GetRandHash().

    0void FastRandomContext::RandomSeed() noexcept
    1{
    2    uint256 seed = GetRandHash();
    3    rng.SetKey(MakeByteSpan(seed));
    4    requires_seed = false;
    5}
    

    => :boom: @sipa Would you be okay with us undoing the move of SeedRandomForTest from your 97e16f57042cab07e5e73f6bed19feec2006e4f7 so it happens after the folder name generation like it used to, meaning that part would not be affected by a deterministic seed (potentially being the same for multiple jobs if RANDOM_CTX_SEED is set)?

  7. sipa commented at 2:10 pm on August 23, 2024: member
    @hodlinator I probably won’t have time to look into this the next two weeks, but I’m happy to defer to @maflcko on this.
  8. hodlinator referenced this in commit cc5782fca1 on Aug 28, 2024
  9. hodlinator referenced this in commit db8bd3ad1e on Aug 28, 2024
  10. hodlinator referenced this in commit eeeb56b74c on Aug 28, 2024
  11. hodlinator referenced this in commit 065f0f66d6 on Aug 29, 2024
  12. maflcko commented at 7:33 am on November 14, 2024: member

    This was fixed in ad9c2cceda9cd893c0f754e49f7fca6e417ee95f by using the test name as seed.

    At least I can no longer reproduce, even when pinning the time to a constant:

     0diff --git a/src/test/util/setup_common.cpp b/src/test/util/setup_common.cpp
     1index 46a6daa88c..185adf3457 100644
     2--- a/src/test/util/setup_common.cpp
     3+++ b/src/test/util/setup_common.cpp
     4@@ -138,7 +138,7 @@ BasicTestingSetup::BasicTestingSetup(const ChainType chainType, TestOpts opts)
     5 
     6     const std::string test_name{G_TEST_GET_FULL_NAME ? G_TEST_GET_FULL_NAME() : ""};
     7     if (!m_node.args->IsArgSet("-testdatadir")) {
     8-        const auto now{TicksSinceEpoch<std::chrono::nanoseconds>(SystemClock::now())};
     9+        const auto now{1337};
    10         m_path_root = fs::temp_directory_path() / TEST_DIR_PATH_ELEMENT / test_name / util::ToString(now);
    11         TryCreateDirectories(m_path_root);
    12     } else {
    
    0RANDOM_CTX_SEED=1 ctest --test-dir ./bld-cmake -j 16
    

    Obviously it may still fail (unlikely), if the multiple test runners are run in parallel, but that is true for the functional test runner as well, so shouldn’t be a problem.

  13. fanquake closed this on Nov 14, 2024


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-11-23 21:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me