test: Add random interval upper bound in regtest

hebasto commented at 7:24 am on July 25, 2020: member

Actually, the exponential distribution is unbounded that causes intermittent timeout errors in functional tests.

Fix #18832 Fix #18969 Fix #19265 Fix #19311

in src/init.cpp:570 in 237bc3edbb outdated

566@@ -567,6 +567,7 @@ void SetupServerArgs(NodeContext& node)
567     gArgs.AddArg("-rpcwhitelistdefault", "Sets default behavior for rpc whitelisting. Unless rpcwhitelistdefault is set to 0, if any -rpcwhitelist is set, the rpc server acts as if all rpc users are subject to empty-unless-otherwise-specified whitelists. If rpcwhitelistdefault is set to 1 and no -rpcwhitelist is set, rpc server acts as if all rpc users are subject to empty whitelists.", ArgsManager::ALLOW_BOOL, OptionsCategory::RPC);
568     gArgs.AddArg("-rpcworkqueue=<n>", strprintf("Set the depth of the work queue to service RPC calls (default: %d)", DEFAULT_HTTP_WORKQUEUE), ArgsManager::ALLOW_ANY | ArgsManager::DEBUG_ONLY, OptionsCategory::RPC);
569     gArgs.AddArg("-server", "Accept command line and JSON-RPC commands", ArgsManager::ALLOW_ANY, OptionsCategory::RPC);
570+    gArgs.AddArg("-uniformintervaltest", "Use uniform distribution for random intervals instead of exponential one (default: 0)", ArgsManager::ALLOW_BOOL | ArgsManager::DEBUG_ONLY, OptionsCategory::DEBUG_TEST);

MarcoFalke commented at 7:40 am on July 25, 2020:

not sure if adding another mock-setting is the right approach here

This is not guarded and can accidentally be set on mainnet
This will run a different code path when set (partially code that has been added for this test-setting only), thus reducing the effectiveness of tests and increasing the maintenance burden
There is already a way to mock the poisson dist with the noban permission (This could be extracted to a hidden test-only permission if there are concerns about noban being too broad)
A timeout of 60s with the expected value of the poisson dist (5s or 7s or whatever) yields a failure rate that is far lower than the failure rate we see due to infrastructure problems or other test problems and races. The issue here is that the nodes are daisy chained and thus the delays add up, which could bring the delay over 60s more likely.

hebasto commented at 8:38 am on July 25, 2020:

This is not guarded and can accidentally be set on mainnet

Yes. It must be guarded. Going to fix it.

This will run a different code path when set (partially code that has been added for this test-setting only), thus reducing the effectiveness of tests and increasing the maintenance burden

The current code path, which is skipped in tests with this PR, is not covered by tests in master.

There is already a way to mock the poisson dist with the noban permission (This could be extracted to a hidden test-only permission if there are concerns about noban being too broad)

This approach does not work universally, e.g., in wallet_backup.py: https://github.com/bitcoin/bitcoin/blob/40a04814d130dfc9131af3f568eb44533e2bcbfc/test/functional/wallet_backup.py#L52-L57

A timeout of 60s with the expected value of the poisson dist (5s or 7s or whatever) yields a failure rate that is far lower than the failure rate we see due to infrastructure problems or other test problems and races. The issue here is that the nodes are daisy chained and thus the delays add up, which could bring the delay over 60s more likely.

The assumption about test failures that would be fixed by this PR could be biased as in the most cases we have only CI logs that lack data about infrastructure problems.

MarcoFalke commented at 7:55 pm on August 12, 2020:

The current code path, which is skipped in tests with this PR, is not covered by tests in master.

The same might be true for this change. Capping the time might never execute code paths (in other functions) that only run after a long timeout.

hebasto force-pushed on Jul 25, 2020

hebasto renamed this:
~~test: Use uniform distribution for random intervals in functional tests~~
test: Add random interval upper bound in regtest
on Jul 25, 2020

hebasto commented at 9:55 am on July 25, 2020: member

Reworked. @MarcoFalke Mind taking another look?

The changes are pretty local, and if they will not justify themselves, they could be easily reverted back :smiley:

DrahtBot commented at 6:33 pm on July 26, 2020: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#18077 (net: Add NAT-PMP port forwarding support by hebasto)
#17428 (p2p: Try to preserve outbound block-relay-only connections during restart by hebasto)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

DrahtBot added the label P2P on Jul 26, 2020

DrahtBot added the label Tests on Jul 26, 2020

hebasto force-pushed on Jul 29, 2020

hebasto commented at 11:41 am on July 29, 2020: member

Rebased c6d1cbdb4b7d21ebb3254781f8f6a6c8a679837d -> 8a945c0a4834b60c5453485a78cbae45ce65b3d9 (pr19588.02 -> pr19588.03) to include the recent CI changes.

test: Add random interval upper bound in regtest

These changes fix intermittent timeout errors in the functional tests.

b34b8edfb2

in src/net.cpp:2858 in 8a945c0a48 outdated

2854@@ -2853,7 +2855,16 @@ int64_t CConnman::PoissonNextSendInbound(int64_t now, int average_interval_secon
2855 
2856 int64_t PoissonNextSend(int64_t now, int average_interval_seconds)
2857 {
2858-    return now + (int64_t)(log1p(GetRand(1ULL << 48) * -0.0000000000000035527136788 /* -1/2^48 */) * average_interval_seconds * -1000000.0 + 0.5);
2859+    const int64_t average_interval_ms = average_interval_seconds * 1000000ULL;

instagibbs commented at 3:10 pm on August 5, 2020:

math isn’t my strongest suit: Why are we multiplying seconds by 100k to get ms? 🤔

hebasto commented at 3:14 pm on August 5, 2020:

math isn’t my strongest suit: Why are we multiplying seconds by 100k to get ms? thinking

Why 1000000ULL is 100k? :)

hebasto commented at 3:20 pm on August 5, 2020:

Going to update names.

hebasto commented at 3:26 pm on August 5, 2020:

Thanks! Updated.

instagibbs commented at 3:32 pm on August 5, 2020:

too many zeroes clearly :eyes: , thanks for update of name

hebasto force-pushed on Aug 5, 2020

hebasto commented at 3:29 pm on August 5, 2020: member

Updated 8a945c0a4834b60c5453485a78cbae45ce65b3d9 -> b34b8edfb24a7660376466bb136affcc6d688b75 (pr19588.03 -> pr19588.04, diff).

Addressed @instagibbs’ comment:

math isn’t my strongest suit: Why are we multiplying seconds by 100k to get ms?

in src/net.cpp:2862 in b34b8edfb2

2854@@ -2853,7 +2855,16 @@ int64_t CConnman::PoissonNextSendInbound(int64_t now, int average_interval_secon
2855 
2856 int64_t PoissonNextSend(int64_t now, int average_interval_seconds)
2857 {
2858-    return now + (int64_t)(log1p(GetRand(1ULL << 48) * -0.0000000000000035527136788 /* -1/2^48 */) * average_interval_seconds * -1000000.0 + 0.5);
2859+    const int64_t average_interval_microseconds = average_interval_seconds * 1000000ULL;
2860+    int64_t delta_microseconds = -std::log1p(GetRand(1ULL << 48) * -0.0000000000000035527136788 /* -1/2^48 */) * average_interval_microseconds + 0.5;
2861+
2862+    // Upper bound to avoid timeouts in the functional tests.
2863+    static const bool regtest = gArgs.GetChainName() == CBaseChainParams::REGTEST;

laanwj commented at 6:02 pm on August 12, 2020:

I think this is really ugly, both doing a string compare here, making the function depend on a global, and doing this at first call (which is not guaranteed to be after initializing the chain name). Though I don’t really have a proposal how to do this differently.

hebasto closed this on Aug 15, 2020

DrahtBot locked this on Feb 15, 2022

test: Add random interval upper bound in regtest #19588

Conflicts