The first commits adds an ElapseSteady helper and type aliases. The second commit uses those helpers in ReportHeadersPresync and in the fuzz target to increase determinism.
Testing
It can be tested via (setting 32 parallel threads):
0cargo run --manifest-path ./contrib/devtools/deterministic-fuzz-coverage/Cargo.toml --$PWD/bld-cmake/$PWD/../b-c-qa-assets/fuzz_corpora/ p2p_headers_presync 32
The failing diff is contained in the commit messages, if applicable.
DrahtBot
commented at 8:15 am on April 2, 2025:
contributor
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.
DrahtBot renamed this:
fuzz: Make p2p_headers_presync more deterministic
fuzz: Make p2p_headers_presync more deterministic
on Apr 2, 2025
DrahtBot added the label
Tests
on Apr 2, 2025
1440000bytes
commented at 4:46 pm on April 2, 2025:
none
Concept ACK
maflcko force-pushed
on Apr 2, 2025
maflcko force-pushed
on Apr 2, 2025
maflcko
commented at 6:41 am on April 3, 2025:
member
I’ve run cargo run --manifest-path ./contrib/devtools/deterministic-fuzz-coverage/Cargo.toml -- $PWD/bld-cmake/ $PWD/../qa-assets/fuzz_corpora/ p2p_headers_presync 32 150 times and it passed, so I guess this is reasonably deterministic now.
I’ve also force pushed to modify the commit messages with details.
marcofleon
commented at 3:10 pm on April 3, 2025:
contributor
150 times and it passed
When you say it passed, are you referring to running the inputs one by one, or including the full corpus run as well?
I still see differences in coverage when running the full corpus, but the previous instability that this PR addresses doesn’t seem to be showing up anymore, so that’s good.
maflcko
commented at 3:25 pm on April 3, 2025:
member
When you say it passed, are you referring to running the inputs one by one, or including the full corpus run as well?
Both. (All of it in the full command)
I still see differences in coverage when running the full corpus, but the previous instability that this PR addresses doesn’t seem to be showing up anymore, so that’s good.
Are you using libFuzzer?
marcofleon
commented at 3:30 pm on April 3, 2025:
contributor
Are you using libFuzzer?
yeah just figured this out, disregard. Do you know why this is? What about libFuzzer trips it up?
maflcko
commented at 3:32 pm on April 3, 2025:
member
Not sure yet. It could be libFuzzer and coverage instrumentation interacting in a bad way, but I have no idea. Maybe I can take a look later.
marcofleon
commented at 4:54 pm on April 3, 2025:
contributor
@dergoegge and I checked it out. Turns out libfuzzer shuffles the corpus, so running it twice with a different ordering of inputs can result in different coverage. On this PR, running with libFuzzer and setting -shuffle=0 should produce no coverage differences for p2p_headers_presync.
I think this is another sign of non-determinism, as the ordering shouldn’t affect line hit counts. It seems that using libFuzzer does end up providing a bit of new information.
edit: It probably shows instability caused by global state? Which was the intention of doing the full corpus run through in the first place. So shuffling it between runs might be necessary for catching this sort of non-determinism.
maflcko
commented at 5:18 pm on April 3, 2025:
member
So shuffling it between runs might be necessary for catching this sort of non-determinism.
Crypt-iQ
commented at 7:25 pm on April 3, 2025:
contributor
This may be a shot in the dark, but is it possible that ResetAndInitialize re-initializing the id counter to 0 introduces some non-determinism across runs depending on the ordering of the corpus? During normal bitcoind operation, the NodeId is always monotonically increasing; is there a reason it is reset in p2p_headers_presync and process_messages?
dergoegge
commented at 1:20 pm on April 4, 2025:
member
is there a reason it is reset in p2p_headers_presync and process_messages?
For each test case we want to simulate starting out with no peers (i.e. id = 0) and then add peers and send messages.
Crypt-iQ
commented at 2:37 pm on April 4, 2025:
contributor
For each test case we want to simulate starting out with no peers (i.e. id = 0) and then add peers and send messages.
That makes sense, I think my question was a bit unclear. There are two places I can see in net_processing where NodeId persists even after the peer has disconnected (i.e. polluting global state somewhat between fuzz iterations):
lNodesAnnouncingHeaderAndIDs
m_headers_presync_bestpeer
The first one does not have any fuzz coverage, but the second one is executed by p2p_headers_presync. Running the full command in the OP above gives the following result where m_headers_presync_bestpeerno longer exists in m_headers_presync_stats:
Changing the NodeId to be a member of HeadersSyncSetup that is monotonically increasing does get rid of the above non-determinism, but it seems to introduce some new non-determinism elsewhere.
Edit: The “new non-determinism” seems to be pre-existing, so I think changing NodeId to monotonically increase does reduce non-determinism if libfuzzer is used with shuffling corpus inputs.
marcofleon
commented at 4:12 pm on April 4, 2025:
contributor
Commit 2222dd8e60d82e5aa96f3d80a81ddca880d627ea added to this PR fixes all non-determinism for me. I should probably run the script more than 10 times to be sure, but so far so good.
@Crypt-iQ the part you’re highlighting is the only code where the connection to Niklas’s rationale isn’t immediately clear. I think it could be in MaybeSendPing where sometimes a peer gets marked for disconnection based on some timeout which is dependent on mock time. That node then gets removed from m_header_presync_stats in FinalizeNode. So having the two different mock times and a different input ordering would affect which peers are part of the presync.
Crypt-iQ
commented at 4:44 pm on April 4, 2025:
contributor
@marcofleon That sounds plausible, I don’t immediately understand the connection to mock time; I was thinking along the lines of the NodeId getting reused across iterations:
Iteration 1: id=0 is set as m_headers_presync_bestpeer
Iteration 2: ResetAndInitialize calls FinalizeNode which wipes all peer state except for m_headers_presync_bestpeer
Iteration 2: Depending on the corpus input, the peer with id=0 sends a HEADERS message which eventually calls IsContinuationOfLowWorkHeadersSync. This will populate the m_headers_presync_stats with id=0 which then means that the best peer will be found in m_headers_presync_stats.
marcofleon
commented at 11:45 am on April 7, 2025:
contributor
That’s a good explanation, makes sense to me. I spent some time trying to connect what you’re saying to the mock time fix, but turns out there is no connection. I forgot that taking out SetMockTime(ConsumeTime(fuzzed_data_provider)) changes the input format so running on the old corpus isn’t useful. As a workaround to generating a new corpus, I left the ConsumeTime in the test but didn’t set it to mock time and this specific instability still shows up. It’s separate from #32198 (comment) then.
Is m_header_presync_bestpeer only used for reporting/logging to validation? I’m not sure how useful to the test it would be to fix this specific source of non-determinism. Unless there’s a relatively simple fix that doesn’t slow down the test.
Commit 2222dd8e60d82e5aa96f3d80a81ddca880d627ea seems to address some of the other non-determinism that shows up so I think including it in this PR makes sense.
edit: I should add that #32198 (comment) should be disregarded, as it is incorrect.
maflcko
commented at 12:07 pm on April 7, 2025:
member
However, I am not seeing the global state that causes this. For reference, one part of the non-determinism is fixed by the change in commit 2222dd8. I am happy to re-add the commit here, but I’d feel better if I fully understood the commit myself.
I think it might be the m_rng on peerman which is a global in this test.
In the following m_rng is only used if tx_relay->m_next_inv_send_time < current_time and 2222dd8 causes that to no longer be the case:
Thanks! I missed that ResetAndInitialize calls NextInvToInbounds. Pushed the commit again.
So shuffling it between runs might be necessary for catching this sort of non-determinism.
Yeah, good catch and nice idea!
Update: Pushed a commit for this as well.
Iteration 1: id=0 is set as m_headers_presync_bestpeer
Nice explanation. Add a commit for this as well.
maflcko force-pushed
on Apr 7, 2025
maflcko
commented at 12:35 pm on April 7, 2025:
member
force pushed with one more fix. Also, reordered some commits.
Obviously, the non-deterministic std::ranges::shuffle(files, InsecureRandomContext{uint64_t(SystemClock::now().time_since_epoch().count())}); remains non-deterministic and should be ignored for now.
Let me know if this should be fixed. Possible fixes are:
Pre-generated (cached) random numbers that are generated before coverage measurements start
Moving the code to a translation unit that is excluded from coverage.
Crypt-iQ
commented at 1:38 pm on April 7, 2025:
contributor
Is m_header_presync_bestpeer only used for reporting/logging to validation? I’m not sure how useful to the test it would be to fix this specific source of non-determinism. Unless there’s a relatively simple fix that doesn’t slow down the test.
I forgot that taking out SetMockTime(ConsumeTime(fuzzed_data_provider)) changes the input format so running on the old corpus isn’t useful.
This brings up something I was wondering about – is there any current way to migrate the corpus if it gets invalidated? I was looking at FuzzedDataProvider and realized it takes from the front and back of the input which makes me a bit confused on how migration can happen.
Edit: Does fa2d3441c6557adba3d393abfc1af22fb788b5c2 invalidate the existing corpus or is the idea to either migrate or run the fuzzer long enough to achieve the same coverage?
sipa
commented at 1:46 pm on April 7, 2025:
member
Is m_header_presync_bestpeer only used for reporting/logging to validation?
It should be. Header presync is an inherently per-peer data operation (as the whole point is avoiding impact on global data structures until the peer has proven their chain is worthwhile), but for progress reporting we need a single number, so only the most advanced header sync is used.
maflcko
commented at 2:02 pm on April 7, 2025:
member
I forgot that taking out SetMockTime(ConsumeTime(fuzzed_data_provider)) changes the input format so running on the old corpus isn’t useful.
This brings up something I was wondering about – is there any current way to migrate the corpus if it gets invalidated? I was looking at FuzzedDataProvider and realized it takes from the front and back of the input which makes me a bit confused on how migration can happen.
Edit: Does fa2d344 invalidate the existing corpus or is the idea to either migrate or run the fuzzer long enough to achieve the same coverage?
In this case it should be trivial by just popping the last bytes (whatever number of bytes it takes for ConsumeTime) from the end. However, I don’t know how useful the prior inputs were, given that the fuzz target never was fully deterministic?
marcofleon
commented at 2:50 pm on April 8, 2025:
contributor
@janb84 I could be wrong but I think that might be coming from faaf3819f2c957a4c9688386b8301a1ef70ace07. If so, then #32198 (comment) mentioned it as a source of non-determinism to ignore for now.
janb84
commented at 4:21 pm on April 8, 2025:
contributor
@janb84 I could be wrong but I think that might be coming from faaf381. If so, then #32198 (comment) mentioned it as a source of non-determinism to ignore for now.
Ah my bad, seems logical. Thanks @marcofleon for pointing this out to me, somehow I didn’t made the connection between the two.
maflcko
commented at 9:57 am on April 9, 2025:
member
Instability in scheduler.cpp?
I don’t know where this is coming from and I can’t reproduce. Are you running with libFuzzer or without?
However, the scheduler thread isn’t needed at all, so I dropped it.
Let me know if this should be fixed.
I think this should be fixed to not give false positives
By using the standard library random module (which is not coverage-instrumented), it should be sufficient to avoid the non-determinism in the random module from the shuffle. Thus, replacing std::ranges::shuffle(files, InsecureRandomContext{uint64_t(SystemClock::now().time_since_epoch().count())}); by std::ranges::shuffle(files, std::mt19937{std::random_device{}()}); should fix it. However, I couldn’t confirm this by testing locally.
(I’m too junior PR reviewer to judge if the randomness issue has to be solved in this PR)
maflcko
commented at 3:45 pm on April 9, 2025:
member
(I’m too junior PR reviewer to judge if the randomness issue has to be solved in this PR)
I ran 400 times with the libFuzzer -shuffle=1 version and all of them passed, so at least there it seems somewhat deterministic.
The std::shuffle version seemed to shuffle more and ran into non-determinism. Probably the one you reported. I pushed one more commit to fix this as well. The commit message mentions that it did not reproduce on libFuzzer for me.
janb84
commented at 4:13 pm on April 9, 2025:
contributor
When replacing std::ranges::shuffle(files, InsecureRandomContext{uint64_t(SystemClock::now().time_since_epoch().count())}); with std::ranges::shuffle(files, std::mt19937{std::random_device{}()}); in this commit as suggested, the test is deterministic !
Have run the fuzz test 10x ✅
fuzz: Shuffle files before testing them
When iterating over all fuzz input files in a folder, the order should
not matter.
However, shuffling may be useful to detect non-determinism.
Thus, shuffle in fuzz.cpp, when using neither libFuzzer, nor AFL.
Also, shuffle in the deterministic-fuzz-coverage tool, when using
libFuzzer.
faf2e238fb
fuzz: Set ignore_incoming_txs in p2p_headers_presync
This avoids non-determistic code paths.
Without this patch, the tool would report a diff:
cargo run --manifest-path ./contrib/devtools/deterministic-fuzz-coverage/Cargo.toml -- $PWD/bld-cmake/ $PWD/../qa-assets/fuzz_corpora/ p2p_headers_presync 32
...
- 5371| 393| peer.m_next_send_feefilter = current_time + m_rng.randrange<std::chrono::microseconds>(MAX_FEEFILTER_CHANGE_DELAY);
- 5372| 393| }
+ 5371| 396| peer.m_next_send_feefilter = current_time + m_rng.randrange<std::chrono::microseconds>(MAX_FEEFILTER_CHANGE_DELAY);
+ 5372| 396| }
5373| 16.2k|}
...
fa98455e4b
fuzz: Move global node id counter along with other global state
The global m_headers_presync_stats is not reset in ResetAndInitialize.
This may lead to non-determinism.
Fix it by incrementing the global node id counter instead.
Without this patch, the tool would report a diff:
cargo run --manifest-path ./contrib/devtools/deterministic-fuzz-coverage/Cargo.toml -- $PWD/bld-cmake/ $PWD/../qa-assets/fuzz_corpora/ p2p_headers_presync 32
...
2587| 3.73k| if (best_it == m_headers_presync_stats.end()) {
------------------
- | Branch (2587:17): [True: 80, False: 3.65k]
+ | Branch (2587:17): [True: 73, False: 3.66k]
------------------
...
faf2d512c5
test: Introduce MockableSteadyClock::mock_time_point and ElapseSteady helper
This refactor clarifies that the MockableSteadyClock::mock_time_point
has millisecond precision by defining a type an using it.
Moreover, a ElapseSteady helper is added which can be re-used easily.
fa9c38794e
refactor: Use MockableSteadyClock in ReportHeadersPresync
This allows the clock to be mockable in tests. Also, replace cs_main
with GetMutex() while touching this function.
Also, use the ElapseSteady test helper in the p2p_headers_presync fuzz
target to make it more deterministic.
The m_last_presync_update variable is a global that is not reset in
ResetAndInitialize. However, it is only used for logging, so completely
disable it for now.
Without this patch, the tool would report a diff:
cargo run --manifest-path ./contrib/devtools/deterministic-fuzz-coverage/Cargo.toml -- $PWD/bld-cmake/ $PWD/../qa-assets/fuzz_corpora/ p2p_headers_presync 32
...
4468| 81| auto now = std::chrono::steady_clock::now();
4469| 81| if (now < m_last_presync_update + std::chrono::milliseconds{250}) return;
- ^80
+ ^79
...
fad22149f4
fuzz: Avoid setting the mock-time twice
It should be sufficient to set it once. Especially, if the dynamic value
is only used by ResetAndInitialize.
This also avoids non-determistic code paths, when ResetAndInitialize may
re-initialize m_next_inv_to_inbounds.
Without this patch, the tool would report a diff:
cargo run --manifest-path ./contrib/devtools/deterministic-fuzz-coverage/Cargo.toml -- $PWD/bld-cmake/ $PWD/../qa-assets/fuzz_corpora/ p2p_headers_presync 32
...
- 1126| 3| m_next_inv_to_inbounds = now + m_rng.rand_exp_duration(average_interval);
- 1127| 3| }
+ 1126| 10| m_next_inv_to_inbounds = now + m_rng.rand_exp_duration(average_interval);
+ 1127| 10| }
1128| 491| return m_next_inv_to_inbounds;
...
fafaca6cbc
fuzz: Disable unused validation interface and scheduler in p2p_headers_presync
This may also avoid non-determinism in the scheduler thread.
faf4c1b6fc
fuzz: Avoid influence on the global RNG from peerman m_rng
This should avoid the remaining non-determistic code coverage paths.
Without this patch, the tool would report a diff (only when running
without libFuzzer):
cargo run --manifest-path ./contrib/devtools/deterministic-fuzz-coverage/Cargo.toml -- $PWD/bld-cmake/ $PWD/../qa-assets/fuzz_corpora/ p2p_headers_presync 32
faa3ce3199
maflcko force-pushed
on Apr 9, 2025
maflcko
commented at 6:08 pm on April 9, 2025:
member
When replacing std::ranges::shuffle(files, InsecureRandomContext{uint64_t(SystemClock::now().time_since_epoch().count())}); with std::ranges::shuffle(files, std::mt19937{std::random_device{}()}); in this commit as suggested, the test is deterministic !
Ok, now I see that this is required for libc++, which I guess you are using? (I was using the GCC std lib)
Pushed that diff as well.
Hopefully this is the last one :sweat_smile:
janb84
commented at 6:27 pm on April 9, 2025:
contributor
Test is deterministic without changes ✅ (tested 10x)
When replacing std::ranges::shuffle(files, InsecureRandomContext{uint64_t(SystemClock::now().time_since_epoch().count())}); with std::ranges::shuffle(files, std::mt19937{std::random_device{}()}); in this commit as suggested, the test is deterministic !
Ok, now I see that this is required for libc++, which I guess you are using? (I was using the GCC std lib)
Yes the macos version of it & Clang 19
Pushed that diff as well.
Hopefully this is the last one 😅
I’m sorry, I was under the impression you wanted to do that change in a later PR.
maflcko requested review from marcofleon
on Apr 10, 2025
marcofleon
commented at 9:08 am on April 10, 2025:
contributor
Coverage on this PR is just from a run overnight but seems to be about the same where it counts.
I’ve run the script 10 times or so and the determinism is looking good!
Crypt-iQ
commented at 12:40 pm on April 10, 2025:
contributor
tACKfaf4c1b6fc330885ddf84b643838d1e301aaeab2
Have verified that the shuffle change in the first commit gets rid of the non-determinism. Just a question, does disabling the scheduler remove some important coverage that otherwise wouldn’t be hit?
maflcko
commented at 2:20 pm on April 10, 2025:
member
Just a question, does disabling the scheduler remove some important coverage that otherwise wouldn’t be hit?
I’d say no, see also #31841 (comment). I am thinking about disabling it globally for all fuzz targets and turn the opt-out into an opt-in. However, that’ll be a separate follow-up.
maflcko
commented at 2:21 pm on April 10, 2025:
member
This is a metadata mirror of the GitHub repository
bitcoin/bitcoin.
This site is not affiliated with GitHub.
Content is generated from a GitHub metadata backup.
generated: 2025-04-16 15:12 UTC
This site is hosted by @0xB10C More mirrored repositories can be found on mirror.b10c.me