fuzz: Make p2p_headers_presync more deterministic

maflcko commented at 8:14 am on April 2, 2025: member

This should make the p2p_headers_presync fuzz target more deterministic.

Tracking issue: #29018.

The first commits adds an ElapseSteady helper and type aliases. The second commit uses those helpers in ReportHeadersPresync and in the fuzz target to increase determinism.

Testing

It can be tested via (setting 32 parallel threads):

0cargo run --manifest-path ./contrib/devtools/deterministic-fuzz-coverage/Cargo.toml -- $PWD/bld-cmake/ $PWD/../b-c-qa-assets/fuzz_corpora/ p2p_headers_presync 32

The failing diff is contained in the commit messages, if applicable.

DrahtBot commented at 8:15 am on April 2, 2025: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32198.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	janb84, marcofleon, Crypt-iQ
Concept ACK	1440000bytes

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

DrahtBot renamed this:
~~fuzz: Make p2p_headers_presync more deterministic~~
fuzz: Make p2p_headers_presync more deterministic
on Apr 2, 2025

DrahtBot added the label Tests on Apr 2, 2025

1440000bytes commented at 4:46 pm on April 2, 2025: none

Concept ACK

maflcko force-pushed on Apr 2, 2025

maflcko commented at 6:41 am on April 3, 2025: member

I’ve run cargo run --manifest-path ./contrib/devtools/deterministic-fuzz-coverage/Cargo.toml -- $PWD/bld-cmake/ $PWD/../qa-assets/fuzz_corpora/ p2p_headers_presync 32 150 times and it passed, so I guess this is reasonably deterministic now.

I’ve also force pushed to modify the commit messages with details.

Coverage reports:

https://drahtbot.space/host_reports/DrahtBot/reports/coverage_fuzz/monotree/fa27b91aaa75d48c/728b3feddc08afb3/fuzz.coverage/index.html

https://drahtbot.space/host_reports/DrahtBot/reports/coverage_fuzz/monotree/77dff373a667a22d/728b3feddc08afb3/fuzz.coverage/index.html

marcofleon commented at 3:10 pm on April 3, 2025: contributor

150 times and it passed

When you say it passed, are you referring to running the inputs one by one, or including the full corpus run as well?

I still see differences in coverage when running the full corpus, but the previous instability that this PR addresses doesn’t seem to be showing up anymore, so that’s good.

maflcko commented at 3:25 pm on April 3, 2025: member

When you say it passed, are you referring to running the inputs one by one, or including the full corpus run as well?

Both. (All of it in the full command)

I still see differences in coverage when running the full corpus, but the previous instability that this PR addresses doesn’t seem to be showing up anymore, so that’s good.

Are you using libFuzzer?

marcofleon commented at 3:30 pm on April 3, 2025: contributor

Are you using libFuzzer?

yeah just figured this out, disregard. Do you know why this is? What about libFuzzer trips it up?

maflcko commented at 3:32 pm on April 3, 2025: member

Not sure yet. It could be libFuzzer and coverage instrumentation interacting in a bad way, but I have no idea. Maybe I can take a look later.

marcofleon commented at 4:54 pm on April 3, 2025: contributor

@dergoegge and I checked it out. Turns out libfuzzer shuffles the corpus, so running it twice with a different ordering of inputs can result in different coverage. On this PR, running with libFuzzer and setting -shuffle=0 should produce no coverage differences for p2p_headers_presync.

I think this is another sign of non-determinism, as the ordering shouldn’t affect line hit counts. It seems that using libFuzzer does end up providing a bit of new information.

edit: It probably shows instability caused by global state? Which was the intention of doing the full corpus run through in the first place. So shuffling it between runs might be necessary for catching this sort of non-determinism.

maflcko commented at 5:18 pm on April 3, 2025: member

So shuffling it between runs might be necessary for catching this sort of non-determinism.

Yeah, good catch and nice idea!

However, I am not seeing the global state that causes this. For reference, one part of the non-determinism is fixed by the change in commit https://github.com/bitcoin/bitcoin/commit/2222dd8e60d82e5aa96f3d80a81ddca880d627ea. I am happy to re-add the commit here, but I’d feel better if I fully understood the commit myself.

Crypt-iQ commented at 7:25 pm on April 3, 2025: contributor

This may be a shot in the dark, but is it possible that ResetAndInitialize re-initializing the id counter to 0 introduces some non-determinism across runs depending on the ordering of the corpus? During normal bitcoind operation, the NodeId is always monotonically increasing; is there a reason it is reset in p2p_headers_presync and process_messages?

dergoegge commented at 1:20 pm on April 4, 2025: member

However, I am not seeing the global state that causes this. For reference, one part of the non-determinism is fixed by the change in commit https://github.com/bitcoin/bitcoin/commit/2222dd8e60d82e5aa96f3d80a81ddca880d627ea. I am happy to re-add the commit here, but I’d feel better if I fully understood the commit myself.

I think it might be the m_rng on peerman which is a global in this test.

In the following m_rng is only used if tx_relay->m_next_inv_send_time < current_time and https://github.com/bitcoin/bitcoin/commit/2222dd8e60d82e5aa96f3d80a81ddca880d627ea causes that to no longer be the case: https://github.com/bitcoin/bitcoin/blob/65dcbec75661d1652beb927f03b1feab4fab932e/src/net_processing.cpp#L5679-L5686

is there a reason it is reset in p2p_headers_presync and process_messages?

For each test case we want to simulate starting out with no peers (i.e. id = 0) and then add peers and send messages.

Crypt-iQ commented at 2:37 pm on April 4, 2025: contributor

For each test case we want to simulate starting out with no peers (i.e. id = 0) and then add peers and send messages.

That makes sense, I think my question was a bit unclear. There are two places I can see in net_processing where NodeId persists even after the peer has disconnected (i.e. polluting global state somewhat between fuzz iterations):

lNodesAnnouncingHeaderAndIDs
m_headers_presync_bestpeer

The first one does not have any fuzz coverage, but the second one is executed by p2p_headers_presync. Running the full command in the OP above gives the following result where m_headers_presync_bestpeer no longer exists in m_headers_presync_stats:

 0 diff --git a/Users/eugenesiegel/btc/bitcoin-review/build/fuzz_det_cov.show.t0.a.txt b/Users/eugenesiegel/btc/bitcoin-review/build/fuzz_det_cov.show.t0.b.txt
 1 index 0822ec1f73..74efd6c71d 100644
 2  --- a/Users/eugenesiegel/btc/bitcoin-review/build/fuzz_det_cov.show.t0.a.txt
 3  +++ b/Users/eugenesiegel/btc/bitcoin-review/build/fuzz_det_cov.show.t0.b.txt
 4  @@ -102156,38 +102156,38 @@
 5    2586|  3.79k|            bool best_updated = false;
 6    2587|  3.79k|            if (best_it == m_headers_presync_stats.end()) {
 7     ------------------
 8  -  |  Branch (2587:17): [True: 37, False: 3.76k]
 9  +  |  Branch (2587:17): [True: 39, False: 3.75k]
10     ------------------
11    2588|       |                // If the cached best peer is outdated, iterate over all remaining ones (including
12    2589|       |                // newly updated one) to find the best one.
13  - 2590|     37|                NodeId peer_best{-1};
14  - 2591|     37|                const HeadersPresyncStats* stat_best{nullptr};
15  - 2592|     37|                for (const auto& [peer, stat] : m_headers_presync_stats) {
16  + 2590|     39|                NodeId peer_best{-1};
17  + 2591|     39|                const HeadersPresyncStats* stat_best{nullptr};
18  + 2592|     39|                for (const auto& [peer, stat] : m_headers_presync_stats) {
19     ------------------
20  -  |  Branch (2592:47): [True: 37, False: 37]
21  +  |  Branch (2592:47): [True: 39, False: 39]
22     ------------------
23  - 2593|     37|                    if (!stat_best || stat > *stat_best) {
24  + 2593|     39|                    if (!stat_best || stat > *stat_best) {
25                                                       ^0
26     ------------------
27  -  |  Branch (2593:25): [True: 37, False: 0]
28  +  |  Branch (2593:25): [True: 39, False: 0]
29     |  Branch (2593:39): [True: 0, False: 0]
30     ------------------
31  - 2594|     37|                        peer_best = peer;
32  - 2595|     37|                        stat_best = &stat;
33  - 2596|     37|                    }
34  - 2597|     37|                }
35  - 2598|     37|                m_headers_presync_bestpeer = peer_best;
36  - 2599|     37|                best_updated = (peer_best == pfrom.GetId());
37  - 2600|  3.76k|            } else if (best_it->first == pfrom.GetId() || stats > best_it->second) {
38  + 2594|     39|                        peer_best = peer;
39  + 2595|     39|                        stat_best = &stat;
40  + 2596|     39|                    }
41  + 2597|     39|                }
42  + 2598|     39|                m_headers_presync_bestpeer = peer_best;
43  + 2599|     39|                best_updated = (peer_best == pfrom.GetId());
44  + 2600|  3.75k|            } else if (best_it->first == pfrom.GetId() || stats > best_it->second) {

Changing the NodeId to be a member of HeadersSyncSetup that is monotonically increasing does get rid of the above non-determinism, but it seems to introduce some new non-determinism elsewhere.

Edit: The “new non-determinism” seems to be pre-existing, so I think changing NodeId to monotonically increase does reduce non-determinism if libfuzzer is used with shuffling corpus inputs.

marcofleon commented at 4:12 pm on April 4, 2025: contributor

Commit 2222dd8e60d82e5aa96f3d80a81ddca880d627ea added to this PR fixes all non-determinism for me. I should probably run the script more than 10 times to be sure, but so far so good. @Crypt-iQ the part you’re highlighting is the only code where the connection to Niklas’s rationale isn’t immediately clear. I think it could be in MaybeSendPing where sometimes a peer gets marked for disconnection based on some timeout which is dependent on mock time. That node then gets removed from m_header_presync_stats in FinalizeNode. So having the two different mock times and a different input ordering would affect which peers are part of the presync.

Crypt-iQ commented at 4:44 pm on April 4, 2025: contributor

@marcofleon That sounds plausible, I don’t immediately understand the connection to mock time; I was thinking along the lines of the NodeId getting reused across iterations:

Iteration 1: id=0 is set as m_headers_presync_bestpeer
Iteration 2: ResetAndInitialize calls FinalizeNode which wipes all peer state except for m_headers_presync_bestpeer
Iteration 2: Depending on the corpus input, the peer with id=0 sends a HEADERS message which eventually calls IsContinuationOfLowWorkHeadersSync. This will populate the m_headers_presync_stats with id=0 which then means that the best peer will be found in m_headers_presync_stats.

marcofleon commented at 11:45 am on April 7, 2025: contributor

That’s a good explanation, makes sense to me. I spent some time trying to connect what you’re saying to the mock time fix, but turns out there is no connection. I forgot that taking out SetMockTime(ConsumeTime(fuzzed_data_provider)) changes the input format so running on the old corpus isn’t useful. As a workaround to generating a new corpus, I left the ConsumeTime in the test but didn’t set it to mock time and this specific instability still shows up. It’s separate from #32198 (comment) then.

Is m_header_presync_bestpeer only used for reporting/logging to validation? I’m not sure how useful to the test it would be to fix this specific source of non-determinism. Unless there’s a relatively simple fix that doesn’t slow down the test.

Commit 2222dd8e60d82e5aa96f3d80a81ddca880d627ea seems to address some of the other non-determinism that shows up so I think including it in this PR makes sense.

edit: I should add that #32198 (comment) should be disregarded, as it is incorrect.

maflcko commented at 12:07 pm on April 7, 2025: member

However, I am not seeing the global state that causes this. For reference, one part of the non-determinism is fixed by the change in commit 2222dd8. I am happy to re-add the commit here, but I’d feel better if I fully understood the commit myself.

I think it might be the m_rng on peerman which is a global in this test.

In the following m_rng is only used if tx_relay->m_next_inv_send_time < current_time and 2222dd8 causes that to no longer be the case:

Thanks! I missed that ResetAndInitialize calls NextInvToInbounds. Pushed the commit again.

So shuffling it between runs might be necessary for catching this sort of non-determinism.

Yeah, good catch and nice idea!

Update: Pushed a commit for this as well.

Iteration 1: id=0 is set as m_headers_presync_bestpeer

Nice explanation. Add a commit for this as well.

maflcko force-pushed on Apr 7, 2025

maflcko commented at 12:35 pm on April 7, 2025: member

force pushed with one more fix. Also, reordered some commits.

Obviously, the non-deterministic std::ranges::shuffle(files, InsecureRandomContext{uint64_t(SystemClock::now().time_since_epoch().count())}); remains non-deterministic and should be ignored for now.

Let me know if this should be fixed. Possible fixes are:

Pre-generated (cached) random numbers that are generated before coverage measurements start
Moving the code to a translation unit that is excluded from coverage.

Crypt-iQ commented at 1:38 pm on April 7, 2025: contributor

Is m_header_presync_bestpeer only used for reporting/logging to validation? I’m not sure how useful to the test it would be to fix this specific source of non-determinism. Unless there’s a relatively simple fix that doesn’t slow down the test.

I haven’t fully grokked the presync headers logic, so I don’t know but I can find out. This commit fixes it https://github.com/bitcoin/bitcoin/pull/32198/commits/fa2c7cbd503cefba8b7db51be324aa0a3a992f2.

I forgot that taking out SetMockTime(ConsumeTime(fuzzed_data_provider)) changes the input format so running on the old corpus isn’t useful.

This brings up something I was wondering about – is there any current way to migrate the corpus if it gets invalidated? I was looking at FuzzedDataProvider and realized it takes from the front and back of the input which makes me a bit confused on how migration can happen.

Edit: Does fa2d3441c6557adba3d393abfc1af22fb788b5c2 invalidate the existing corpus or is the idea to either migrate or run the fuzzer long enough to achieve the same coverage?

sipa commented at 1:46 pm on April 7, 2025: member

Is m_header_presync_bestpeer only used for reporting/logging to validation?

It should be. Header presync is an inherently per-peer data operation (as the whole point is avoiding impact on global data structures until the peer has proven their chain is worthwhile), but for progress reporting we need a single number, so only the most advanced header sync is used.

maflcko commented at 2:02 pm on April 7, 2025: member

I forgot that taking out SetMockTime(ConsumeTime(fuzzed_data_provider)) changes the input format so running on the old corpus isn’t useful.

This brings up something I was wondering about – is there any current way to migrate the corpus if it gets invalidated? I was looking at FuzzedDataProvider and realized it takes from the front and back of the input which makes me a bit confused on how migration can happen.

Edit: Does fa2d344 invalidate the existing corpus or is the idea to either migrate or run the fuzzer long enough to achieve the same coverage?

In this case it should be trivial by just popping the last bytes (whatever number of bytes it takes for ConsumeTime) from the end. However, I don’t know how useful the prior inputs were, given that the fuzz target never was fully deterministic?

.. run a modified version of the fuzzed data provider: https://github.com/maflcko/DrahtBot/blob/main/inverse_fdp/FuzzedDataProvider.example.patch (with additional adjustments to account for the newly added or newly removed data calls), then copy-paste the resulting code along with the inverse fuzzed data provider (https://github.com/maflcko/DrahtBot/blob/main/inverse_fdp/src/main.rs#L33C12-L33C16) into a new rust executable and run it. Though, I’ve never tried this myself :sweat_smile:

Crypt-iQ commented at 10:34 am on April 8, 2025: contributor

tACK fa2d3441c6557adba3d393abfc1af22fb788b5c2

Ran the script 100 times and it didn’t give any errors.

Let me know if this should be fixed.

I think this should be fixed to not give false positives, but I think this can be done in another PR?

marcofleon commented at 10:58 am on April 8, 2025: contributor

Using a new corpus, this is coming up at times:

  0index 3cb0b6f082..9e79f23f38 100644
  1--- a/./fuzzbuild/fuzz_det_cov.show.t9.a.txt
  2+++ b/./fuzzbuild/fuzz_det_cov.show.t9.b.txt
  3@@ -168783,28 +168783,26 @@
  4    28|       |    // newTaskMutex is locked throughout this loop EXCEPT
  5    29|       |    // when the thread is waiting or when the user's function
  6    30|       |    // is called.
  7-   31|      2|    while (!shouldStop()) {
  8+   31|      1|    while (!shouldStop()) {
  9   ------------------
 10-  |  Branch (31:12): [True: 1, False: 1]
 11+  |  Branch (31:12): [True: 0, False: 1]
 12   ------------------
 13-   32|      1|        try {
 14-   33|      2|            while (!shouldStop() && taskQueue.empty()) {
 15-                                                  ^1
 16+   32|      0|        try {
 17+   33|      0|            while (!shouldStop() && taskQueue.empty()) {
 18   ------------------
 19-  |  Branch (33:20): [True: 1, False: 1]
 20-  |  Branch (33:37): [True: 1, False: 0]
 21+  |  Branch (33:20): [True: 0, False: 0]
 22+  |  Branch (33:37): [True: 0, False: 0]
 23   ------------------
 24    34|       |                // Wait until there is something to do.
 25-   35|      1|                newTaskScheduled.wait(lock);
 26-   36|      1|            }
 27+   35|      0|                newTaskScheduled.wait(lock);
 28+   36|      0|            }
 29    37|       |
 30    38|       |            // Wait until either there is a new task, or until
 31    39|       |            // the time of the first item on the queue:
 32    40|       |
 33-   41|      1|            while (!shouldStop() && !taskQueue.empty()) {
 34-                                                  ^0
 35+   41|      0|            while (!shouldStop() && !taskQueue.empty()) {
 36   ------------------
 37-  |  Branch (41:20): [True: 0, False: 1]
 38+  |  Branch (41:20): [True: 0, False: 0]
 39   |  Branch (41:37): [True: 0, False: 0]
 40   ------------------
 41    42|      0|                std::chrono::steady_clock::time_point timeToWaitFor = taskQueue.begin()->first;
 42@@ -168818,40 +168816,39 @@
 43    47|       |
 44    48|       |            // If there are multiple threads, the queue can empty while we're waiting (another
 45    49|       |            // thread may service the task we were waiting on).
 46-   50|      1|            if (shouldStop() || taskQueue.empty())
 47-                                              ^0
 48+   50|      0|            if (shouldStop() || taskQueue.empty())
 49   ------------------
 50-  |  Branch (50:17): [True: 1, False: 0]
 51+  |  Branch (50:17): [True: 0, False: 0]
 52   |  Branch (50:33): [True: 0, False: 0]
 53   ------------------
 54    51|      1|                continue;
 55    52|       |
 56-   53|      0|            Function f = taskQueue.begin()->second;
 57-   54|      0|            taskQueue.erase(taskQueue.begin());
 58+   53|  18.4E|            Function f = taskQueue.begin()->second;
 59+   54|  18.4E|            taskQueue.erase(taskQueue.begin());
 60    55|       |
 61-   56|      0|            {
 62+   56|  18.4E|            {
 63    57|       |                // Unlock before calling f, so it can reschedule itself or another task
 64    58|       |                // without deadlocking:
 65-   59|      0|                REVERSE_LOCK(lock);
 66+   59|  18.4E|                REVERSE_LOCK(lock);
 67   ------------------
 68-  |  |  243|      0|#define REVERSE_LOCK(g) typename std::decay<decltype(g)>::type::reverse_lock UNIQUE_NAME(revlock)(g, #g, __FILE__, __LINE__)
 69+  |  |  243|  18.4E|#define REVERSE_LOCK(g) typename std::decay<decltype(g)>::type::reverse_lock UNIQUE_NAME(revlock)(g, #g, __FILE__, __LINE__)
 70   |  |  ------------------
 71-  |  |  |  |   11|      0|#define UNIQUE_NAME(name) PASTE2(name, __COUNTER__)
 72+  |  |  |  |   11|  18.4E|#define UNIQUE_NAME(name) PASTE2(name, __COUNTER__)
 73   |  |  |  |  ------------------
 74-  |  |  |  |  |  |    9|      0|#define PASTE2(x, y) PASTE(x, y)
 75+  |  |  |  |  |  |    9|  18.4E|#define PASTE2(x, y) PASTE(x, y)
 76   |  |  |  |  |  |  ------------------
 77-  |  |  |  |  |  |  |  |    8|      0|#define PASTE(x, y) x ## y
 78+  |  |  |  |  |  |  |  |    8|  18.4E|#define PASTE(x, y) x ## y
 79   |  |  |  |  |  |  ------------------
 80   |  |  |  |  ------------------
 81   |  |  ------------------
 82   ------------------
 83-   60|      0|                f();
 84-   61|      0|            }
 85-   62|      0|        } catch (...) {
 86+   60|  18.4E|                f();
 87+   61|  18.4E|            }
 88+   62|  18.4E|        } catch (...) {
 89    63|      0|            --nThreadsServicingQueue;
 90    64|      0|            throw;
 91    65|      0|        }
 92-   66|      1|    }
 93+   66|      0|    }
 94    67|      1|    --nThreadsServicingQueue;
 95    68|      1|    newTaskScheduled.notify_one();
 96    69|      1|}
 97@@ -169245,11 +169242,11 @@
 98   108|       |    int nThreadsServicingQueue GUARDED_BY(newTaskMutex){0};
 99   109|       |    bool stopRequested GUARDED_BY(newTaskMutex){false};
100   110|       |    bool stopWhenEmpty GUARDED_BY(newTaskMutex){false};
101-  111|      6|    bool shouldStop() const EXCLUSIVE_LOCKS_REQUIRED(newTaskMutex) { return stopRequested || (stopWhenEmpty && taskQueue.empty()); }
102-                                                                                                           ^2^2               ^0
103+  111|      4|    bool shouldStop() const EXCLUSIVE_LOCKS_REQUIRED(newTaskMutex) { return stopRequested || (stopWhenEmpty && taskQueue.empty()); }
104+                                                                                                           ^0^0               ^0
105   ------------------
106-  |  Branch (111:77): [True: 4, False: 2]
107-  |  Branch (111:95): [True: 0, False: 2]
108+  |  Branch (111:77): [True: 4, False: 0]
109+  |  Branch (111:95): [True: 0, False: 0]
110   |  Branch (111:112): [True: 0, False: 0]
111   ------------------
112   112|       |};
113@@ -227794,7 +227791,7 @@
114    75|  1.25k|inline void EnterCritical(const char* pszName, const char* pszFile, int nLine, MutexType* cs, bool fTry = false) {}
115   ------------------
116   | void EnterCritical<std::mutex>(char const*, char const*, int, std::mutex*, bool):
117-  |   75|    943|inline void EnterCritical(const char* pszName, const char* pszFile, int nLine, MutexType* cs, bool fTry = false) {}
118+  |   75|    942|inline void EnterCritical(const char* pszName, const char* pszFile, int nLine, MutexType* cs, bool fTry = false) {}
119   ------------------
120   | void EnterCritical<std::recursive_mutex>(char const*, char const*, int, std::recursive_mutex*, bool):
121   |   75|    309|inline void EnterCritical(const char* pszName, const char* pszFile, int nLine, MutexType* cs, bool fTry = false) {}
122@@ -228041,7 +228038,7 @@
123   197|  1.25k|    {
124   198|  1.25k|        if (Base::owns_lock())
125   ------------------
126-  |  Branch (198:13): [True: 943, False: 1]
127+  |  Branch (198:13): [True: 943, False: 0]
128   ------------------
129   |  Branch (198:13): [True: 309, False: 0]
130   ------------------
131@@ -228051,13 +228048,13 @@
132   200|  1.25k|    }
133   ------------------
134   | UniqueLock<AnnotatedMixin<std::mutex> >::~UniqueLock():
135-  |  197|    944|    {
136-  |  198|    944|        if (Base::owns_lock())
137+  |  197|    943|    {
138+  |  198|    943|        if (Base::owns_lock())
139   |  ------------------
140-  |  |  Branch (198:13): [True: 943, False: 1]
141+  |  |  Branch (198:13): [True: 943, False: 0]
142   |  ------------------
143   |  199|    943|            LeaveCritical();
144-  |  200|    944|    }
145+  |  200|    943|    }
146   ------------------
147   | UniqueLock<AnnotatedMixin<std::recursive_mutex> >::~UniqueLock():
148   |  197|    309|    {
149@@ -228108,20 +228105,15 @@
150   | Unexecuted instantiation: UniqueLock<AnnotatedMixin<std::mutex> >::reverse_lock::reverse_lock(UniqueLock<AnnotatedMixin<std::mutex> >&, char const*, char const*, int)
151   ------------------
152   223|       |
153-  224|      1|        ~reverse_lock() {
154-  225|      1|            templock.swap(lock);
155-  226|      1|            EnterCritical(lockname.c_str(), file.c_str(), line, lock.mutex());
156-  227|      1|            lock.lock();
157-  228|      1|        }
158+  224|      0|        ~reverse_lock() {
159+  225|      0|            templock.swap(lock);
160+  226|      0|            EnterCritical(lockname.c_str(), file.c_str(), line, lock.mutex());
161+  227|      0|            lock.lock();
162+  228|      0|        }
163   ------------------
164   | Unexecuted instantiation: UniqueLock<AnnotatedMixin<std::recursive_mutex> >::reverse_lock::~reverse_lock()
165   ------------------
166-  | UniqueLock<AnnotatedMixin<std::mutex> >::reverse_lock::~reverse_lock():
167-  |  224|      1|        ~reverse_lock() {
168-  |  225|      1|            templock.swap(lock);
169-  |  226|      1|            EnterCritical(lockname.c_str(), file.c_str(), line, lock.mutex());
170-  |  227|      1|            lock.lock();
171-  |  228|      1|        }
172+  | Unexecuted instantiation: UniqueLock<AnnotatedMixin<std::mutex> >::reverse_lock::~reverse_lock()
173   ------------------
174   229|       |
175   230|       |     private:
176@@ -228137,7 +228129,7 @@
177   240|       |     friend class reverse_lock;
178   241|       |};
179   242|       |
180-  243|      0|#define REVERSE_LOCK(g) typename std::decay<decltype(g)>::type::reverse_lock UNIQUE_NAME(revlock)(g, #g, __FILE__, __LINE__)
181+  243|  18.4E|#define REVERSE_LOCK(g) typename std::decay<decltype(g)>::type::reverse_lock UNIQUE_NAME(revlock)(g, #g, __FILE__, __LINE__)
182   244|       |
183   245|       |// When locking a Mutex, require negative capability to ensure the lock
184   246|       |// is not already held
185⚠️
186
187The coverage was not deterministic between runs.
188The fuzz target input was ./p2pheaderpresync_newcorp/p2p_headers_presync/30ad165903c86c0b30bd5940c0031c9f89bd264d.

Instability in scheduler.cpp?

janb84 commented at 11:42 am on April 8, 2025: contributor

New corpus all runs end up non-deterministic:

Used command to test:

0 cargo run --manifest-path ./contrib/devtools/deterministic-fuzz-coverage/Cargo.toml -- $PWD/build_fuzz/ $PWD/qa-assets/fuzz_corpora/ p2p_headers_presync 9

Outcome:

0The coverage was not deterministic between runs.

Let me know if you need additional info or I need to test in a different way. (tested 4 times all runs end up like this)

 0@@ -144603,7 +144603,7 @@
 1|  185|  12.5k|    RandomNumberGenerator auto& Impl() noexcept { return static_cast<T&>(*this); }
 2   ------------------
 3   | _ZN11RandomMixinI21InsecureRandomContextE4ImplEv:
 4-  |  185|    668|    RandomNumberGenerator auto& Impl() noexcept { return static_cast<T&>(*this); }
 5+  |  185|    660|    RandomNumberGenerator auto& Impl() noexcept { return static_cast<T&>(*this); }
 6   ------------------
 7   186|       |
 8   187|       |protected:
 9@@ -145092,12 +145092,12 @@
10   ------------------
11   | Unexecuted instantiation: _ZN11RandomMixinI21InsecureRandomContextE3maxEv
12   ------------------
13-  367|    668|    inline uint64_t operator()() noexcept { return Impl().rand64(); }
14+  367|    660|    inline uint64_t operator()() noexcept { return Impl().rand64(); }
15   ------------------
16   | Unexecuted instantiation: _ZN11RandomMixinI17FastRandomContextEclEv
17   ------------------
18   | _ZN11RandomMixinI21InsecureRandomContextEclEv:
19-  |  367|    668|    inline uint64_t operator()() noexcept { return Impl().rand64(); }
20+  |  367|    660|    inline uint64_t operator()() noexcept { return Impl().rand64(); }
21   ------------------
22   368|       |};
23   369|       |
24@@ -145175,14 +145175,14 @@
25   437|      0|    }
26   438|       |
27   439|       |    constexpr uint64_t rand64() noexcept
28-  440|    668|    {
29-  441|    668|        uint64_t s0 = m_s0, s1 = m_s1;
30-  442|    668|        const uint64_t result = std::rotl(s0 + s1, 17) + s0;
31-  443|    668|        s1 ^= s0;
32-  444|    668|        m_s0 = std::rotl(s0, 49) ^ s1 ^ (s1 << 21);
33-  445|    668|        m_s1 = std::rotl(s1, 28);
34-  446|    668|        return result;
35-  447|    668|    }
36+  440|    660|    {
37+  441|    660|        uint64_t s0 = m_s0, s1 = m_s1;
38+  442|    660|        const uint64_t result = std::rotl(s0 + s1, 17) + s0;
39+  443|    660|        s1 ^= s0;
40+  444|    660|        m_s0 = std::rotl(s0, 49) ^ s1 ^ (s1 << 21);
41+  445|    660|        m_s1 = std::rotl(s1, 28);
42+  446|    660|        return result;
43+  447|    660|    }
44   448|       |};
45   449|       |
46   450|       |
47⚠️
48
49The coverage was not deterministic between runs.

marcofleon commented at 2:50 pm on April 8, 2025: contributor

@janb84 I could be wrong but I think that might be coming from faaf3819f2c957a4c9688386b8301a1ef70ace07. If so, then #32198 (comment) mentioned it as a source of non-determinism to ignore for now.

janb84 commented at 4:21 pm on April 8, 2025: contributor

tACK fa2d344

@janb84 I could be wrong but I think that might be coming from faaf381. If so, then #32198 (comment) mentioned it as a source of non-determinism to ignore for now.

Ah my bad, seems logical. Thanks @marcofleon for pointing this out to me, somehow I didn’t made the connection between the two.

maflcko commented at 9:57 am on April 9, 2025: member

Instability in scheduler.cpp?

I don’t know where this is coming from and I can’t reproduce. Are you running with libFuzzer or without?

However, the scheduler thread isn’t needed at all, so I dropped it.

Let me know if this should be fixed.

I think this should be fixed to not give false positives

By using the standard library random module (which is not coverage-instrumented), it should be sufficient to avoid the non-determinism in the random module from the shuffle. Thus, replacing std::ranges::shuffle(files, InsecureRandomContext{uint64_t(SystemClock::now().time_since_epoch().count())}); by std::ranges::shuffle(files, std::mt19937{std::random_device{}()}); should fix it. However, I couldn’t confirm this by testing locally.

I am still seeing non-determinism. For example:

0   196|  60.7k|    if (next_height % HEADER_COMMITMENT_PERIOD == m_commit_offset) {
1   ------------------
2-  |  Branch (196:9): [True: 59, False: 60.6k]
3+  |  Branch (196:9): [True: 60, False: 60.6k]
4   ------------------
5   197|       |        // Add a commitment.
6-  198|     59|        m_header_commitments.push_back(m_hasher(current.GetHash()) & 1);
7-  199|     59|        if (m_header_commitments.size() > m_max_commitments) {
8+  198|     60|        m_header_commitments.push_back(m_hasher(current.GetHash()) & 1);
9+  199|     60|        if (m_header_commitments.size() > m_max_commitments) {

I am not sure how easy it is to avoid all non-determinism. So maybe those can be handled in a follow-up?

janb84 commented at 12:58 pm on April 9, 2025: contributor

Re-ACK faee556

Change since last ACK:

removed he scheduler functionality

(I’m too junior PR reviewer to judge if the randomness issue has to be solved in this PR)

maflcko commented at 3:45 pm on April 9, 2025: member

(I’m too junior PR reviewer to judge if the randomness issue has to be solved in this PR)

I ran 400 times with the libFuzzer -shuffle=1 version and all of them passed, so at least there it seems somewhat deterministic.

The std::shuffle version seemed to shuffle more and ran into non-determinism. Probably the one you reported. I pushed one more commit to fix this as well. The commit message mentions that it did not reproduce on libFuzzer for me.

janb84 commented at 4:13 pm on April 9, 2025: contributor

Re-ACK fa3cd7a

When replacing std::ranges::shuffle(files, InsecureRandomContext{uint64_t(SystemClock::now().time_since_epoch().count())}); with std::ranges::shuffle(files, std::mt19937{std::random_device{}()}); in this commit as suggested, the test is deterministic !

Have run the fuzz test 10x ✅

fuzz: Shuffle files before testing them

When iterating over all fuzz input files in a folder, the order should
not matter.

However, shuffling may be useful to detect non-determinism.

Thus, shuffle in fuzz.cpp, when using neither libFuzzer, nor AFL.

Also, shuffle in the deterministic-fuzz-coverage tool, when using
libFuzzer.

faf2e238fb

fuzz: Set ignore_incoming_txs in p2p_headers_presync

This avoids non-determistic code paths.

Without this patch, the tool would report a diff:

cargo run --manifest-path ./contrib/devtools/deterministic-fuzz-coverage/Cargo.toml -- $PWD/bld-cmake/ $PWD/../qa-assets/fuzz_corpora/ p2p_headers_presync 32

...
- 5371|    393|        peer.m_next_send_feefilter = current_time + m_rng.randrange<std::chrono::microseconds>(MAX_FEEFILTER_CHANGE_DELAY);
- 5372|    393|    }
+ 5371|    396|        peer.m_next_send_feefilter = current_time + m_rng.randrange<std::chrono::microseconds>(MAX_FEEFILTER_CHANGE_DELAY);
+ 5372|    396|    }
  5373|  16.2k|}
...

fa98455e4b

fuzz: Move global node id counter along with other global state

The global m_headers_presync_stats is not reset in ResetAndInitialize.
This may lead to non-determinism.

Fix it by incrementing the global node id counter instead.

Without this patch, the tool would report a diff:

cargo run --manifest-path ./contrib/devtools/deterministic-fuzz-coverage/Cargo.toml -- $PWD/bld-cmake/ $PWD/../qa-assets/fuzz_corpora/ p2p_headers_presync 32

...
  2587|  3.73k|            if (best_it == m_headers_presync_stats.end()) {
   ------------------
-  |  Branch (2587:17): [True: 80, False: 3.65k]
+  |  Branch (2587:17): [True: 73, False: 3.66k]
   ------------------
...

faf2d512c5

test: Introduce MockableSteadyClock::mock_time_point and ElapseSteady helper

This refactor clarifies that the MockableSteadyClock::mock_time_point
has millisecond precision by defining a type an using it.

Moreover, a ElapseSteady helper is added which can be re-used easily.

fa9c38794e

refactor: Use MockableSteadyClock in ReportHeadersPresync

This allows the clock to be mockable in tests. Also, replace cs_main
with GetMutex() while touching this function.

Also, use the ElapseSteady test helper in the p2p_headers_presync fuzz
target to make it more deterministic.

The m_last_presync_update variable is a global that is not reset in
ResetAndInitialize. However, it is only used for logging, so completely
disable it for now.

Without this patch, the tool would report a diff:

cargo run --manifest-path ./contrib/devtools/deterministic-fuzz-coverage/Cargo.toml -- $PWD/bld-cmake/ $PWD/../qa-assets/fuzz_corpora/ p2p_headers_presync 32

...
  4468|     81|        auto now = std::chrono::steady_clock::now();
  4469|     81|        if (now < m_last_presync_update + std::chrono::milliseconds{250}) return;
-                                                                                        ^80
+                                                                                        ^79
...

fad22149f4

fuzz: Avoid setting the mock-time twice

It should be sufficient to set it once. Especially, if the dynamic value
is only used by ResetAndInitialize.

This also avoids non-determistic code paths, when ResetAndInitialize may
re-initialize m_next_inv_to_inbounds.

Without this patch, the tool would report a diff:

cargo run --manifest-path ./contrib/devtools/deterministic-fuzz-coverage/Cargo.toml -- $PWD/bld-cmake/ $PWD/../qa-assets/fuzz_corpora/ p2p_headers_presync 32

...
- 1126|      3|        m_next_inv_to_inbounds = now + m_rng.rand_exp_duration(average_interval);
- 1127|      3|    }
+ 1126|     10|        m_next_inv_to_inbounds = now + m_rng.rand_exp_duration(average_interval);
+ 1127|     10|    }
  1128|    491|    return m_next_inv_to_inbounds;
...

fafaca6cbc

fuzz: Disable unused validation interface and scheduler in p2p_headers_presync

This may also avoid non-determinism in the scheduler thread.

faf4c1b6fc

fuzz: Avoid influence on the global RNG from peerman m_rng

This should avoid the remaining non-determistic code coverage paths.

Without this patch, the tool would report a diff (only when running
without libFuzzer):

cargo run --manifest-path ./contrib/devtools/deterministic-fuzz-coverage/Cargo.toml -- $PWD/bld-cmake/ $PWD/../qa-assets/fuzz_corpora/ p2p_headers_presync 32

faa3ce3199

maflcko force-pushed on Apr 9, 2025

maflcko commented at 6:08 pm on April 9, 2025: member

When replacing std::ranges::shuffle(files, InsecureRandomContext{uint64_t(SystemClock::now().time_since_epoch().count())}); with std::ranges::shuffle(files, std::mt19937{std::random_device{}()}); in this commit as suggested, the test is deterministic !

Ok, now I see that this is required for libc++, which I guess you are using? (I was using the GCC std lib)

Pushed that diff as well.

Hopefully this is the last one :sweat_smile:

janb84 commented at 6:27 pm on April 9, 2025: contributor

Re-ACK faa3ce3

Test is deterministic without changes ✅ (tested 10x)

When replacing std::ranges::shuffle(files, InsecureRandomContext{uint64_t(SystemClock::now().time_since_epoch().count())}); with std::ranges::shuffle(files, std::mt19937{std::random_device{}()}); in this commit as suggested, the test is deterministic !

Ok, now I see that this is required for libc++, which I guess you are using? (I was using the GCC std lib)

Yes the macos version of it & Clang 19

Pushed that diff as well.

Hopefully this is the last one 😅

I’m sorry, I was under the impression you wanted to do that change in a later PR.

maflcko requested review from marcofleon on Apr 10, 2025

marcofleon commented at 9:08 am on April 10, 2025: contributor

ACK faa3ce31998cdbdcb888132b0d38ee7a1e743682

coverage on master coverage on this PR

Coverage on this PR is just from a run overnight but seems to be about the same where it counts.

I’ve run the script 10 times or so and the determinism is looking good!

Crypt-iQ commented at 12:40 pm on April 10, 2025: contributor

tACK faf4c1b6fc330885ddf84b643838d1e301aaeab2

Have verified that the shuffle change in the first commit gets rid of the non-determinism. Just a question, does disabling the scheduler remove some important coverage that otherwise wouldn’t be hit?

maflcko commented at 2:20 pm on April 10, 2025: member

Just a question, does disabling the scheduler remove some important coverage that otherwise wouldn’t be hit?

I’d say no, see also #31841 (comment). I am thinking about disabling it globally for all fuzz targets and turn the opt-out into an opt-in. However, that’ll be a separate follow-up.

maflcko commented at 2:21 pm on April 10, 2025: member

tACK faf4c1b

fyi, looks like your review is stale (not the top commit hash)

Crypt-iQ commented at 2:45 pm on April 10, 2025: contributor

tACK faa3ce31998cdbdcb888132b0d38ee7a1e743682

fyi, looks like your review is stale (not the top commit hash)

(Sorry, copied the wrong commit hash when making the comment)

glozow merged this on Apr 10, 2025

glozow closed this on Apr 10, 2025

maflcko deleted the branch on Apr 10, 2025

fuzz: Make p2p_headers_presync more deterministic #32198

Testing

Code Coverage & Benchmarks

Reviews