fuzz: Use immediate task runner to increase fuzz stability #31841

pull maflcko wants to merge 1 commits into bitcoin:master from maflcko:2502-fuzz-stable changing 1 files +6 −2

maflcko commented at 1:42 pm on February 11, 2025: member
Leaking a scheduler with a non-empty queue from the fuzz initialization phase into the fuzz target execution phase is problematic, because it messes with coverage data. This in turn is problematic, because it leads to:
- Decrease in fuzz target execution stability (non-determinism when running the fuzz target).
- Decrease in fuzz input merge stability (non-determinism when selecting a minimum set of fuzz input to reach maximum coverage), which leads to qa-assets bloat.
Fix one such issue. Tracking issue: #29018

Can be tested via: RUST_BACKTRACE=1 cargo run --manifest-path ./contrib/devtools/deterministic-fuzz-coverage/Cargo.toml -- $PWD/bld-cmake $PWD/../b-c-qa-assets/fuzz_corpora/ partially_downloaded_block.

The failure is non-deterministic (obviously) and will show coverage in validation signals such as UpdatedBlockTip before this change and will have this one fixed after this change.
DrahtBot commented at 1:42 pm on February 11, 2025: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31841.

Reviews

See the guideline for information on the review process.

Type Reviewers

ACK marcofleon, dergoegge

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.
DrahtBot added the label Tests on Feb 11, 2025
maflcko force-pushed on Feb 11, 2025
maflcko commented at 4:47 pm on February 11, 2025: member

Actually, this is incomplete. It will handle the coverage generated in the notification itself (f()), but there is no guarantee that the re-lock is handled as well when the context ends happens before fuzz target execution starts after init. Relevant scope for the re-lock:

https://github.com/bitcoin/bitcoin/blob/86528937e5c4da2e12c46085fc41e87ed759258e/src/scheduler.cpp#L56-L61

The change here should be an improvement, but I wonder if there is a trivial one-line complete solution.

On alternative would be to inline the scheduler in tests (make it synchronous), but the patch would be larger.
maflcko renamed this:
~~test: Clear scheduler after init to increase fuzz stability~~
test: Use serial task runner to increase fuzz stability
on Feb 11, 2025
DrahtBot renamed this:
~~test: Use serial task runner to increase fuzz stability~~
test: Use serial task runner to increase fuzz stability
on Feb 11, 2025
maflcko force-pushed on Feb 11, 2025
maflcko force-pushed on Feb 11, 2025
maflcko force-pushed on Feb 11, 2025
maflcko force-pushed on Feb 11, 2025
DrahtBot added the label CI failed on Feb 11, 2025
DrahtBot commented at 7:27 pm on February 11, 2025: contributor
🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/37046883391

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:
- Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
- A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
- An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
maflcko renamed this:
~~test: Use serial task runner to increase fuzz stability~~
fuzz: Use serial task runner to increase fuzz stability
on Feb 11, 2025
maflcko commented at 7:47 pm on February 11, 2025: member

On alternative would be to inline the scheduler in tests (make it synchronous), but the patch would be larger.

Looks like this will lead to deadlocks due to lock order inversions between cs_main and cs_wallet in the unit tests. I’ll try to apply it to the fuzz tests only. If there are still issues in this pull, or in the future, they could be worked around for the affected fuzz target.
maflcko force-pushed on Feb 11, 2025
maflcko force-pushed on Feb 11, 2025
DrahtBot removed the label CI failed on Feb 11, 2025
maflcko commented at 12:44 pm on February 13, 2025: member
CI passed and coverage is as expected:
- This pull: https://drahtbot.space/host_reports/DrahtBot/reports/coverage_fuzz/monotree/faad87189015e559/733f70b21be22925/fuzz.coverage/index.html
- Before this pull: https://drahtbot.space/host_reports/DrahtBot/reports/coverage_fuzz/monotree/6a46be75c43e6489/733f70b21be22925/fuzz.coverage/index.html
maflcko force-pushed on Feb 19, 2025
dergoegge commented at 10:36 am on February 19, 2025: member

Concept ACK
maflcko commented at 10:38 am on February 19, 2025: member

(rebased without changes to make testing easier)
fuzz: Use serial task runner to increase fuzz stability fa4fb6a8f1
maflcko force-pushed on Mar 14, 2025
fanquake commented at 6:50 am on March 17, 2025: member

cc @marcofleon
marcofleon commented at 7:29 pm on March 17, 2025: contributor

ACK fa4fb6a8f15c295a2a3d3ffd737e115b8be46c1f

Seems fine to me to use ImmediateTaskRunner for fuzz tests. I checked stability for partially_downloaded_block and saw an increase from 90% to 94%.

Am I correct in saying that this change makes calling something like SyncWithValidationInterfaceQueue() in fuzz targets effectively a no-op?

edit: quick nit, I think the PR title means to say “use immediate task runner” yes?
DrahtBot requested review from dergoegge on Mar 17, 2025
maflcko renamed this:
~~fuzz: Use serial task runner to increase fuzz stability~~
fuzz: Use immediate task runner to increase fuzz stability
on Mar 17, 2025
maflcko commented at 7:35 pm on March 17, 2025: member

Am I correct in saying that this change makes calling something like SyncWithValidationInterfaceQueue() in fuzz targets effectively a no-op?

Yes, they could be removed now. Happy to include that here, but they should also be harmless.
marcofleon commented at 10:57 am on March 18, 2025: contributor

Happy to include that here

No need, just confirming for my own understanding.
dergoegge approved
dergoegge commented at 1:11 pm on March 20, 2025: member

Code review ACK fa4fb6a8f15c295a2a3d3ffd737e115b8be46c1f

The only downside to this is that we diverge from the production setup and e.g. bugs that only occur when fuzzing and using the SerialTaskRunner would be masked by this change. We could consider having a compile or runtime option to enable the serial runner and run both versions in CI.
maflcko commented at 2:13 pm on March 20, 2025: member

The only downside to this is that we diverge from the production setup and e.g. bugs that only occur when fuzzing and using the SerialTaskRunner would be masked by this change.

Right, but I wonder what type of bug this could possibly uncover, given that the fuzz targets are mostly single threaded and require some kind of sync (like SyncWithValidationInterfaceQueue) when they use the task runner.

Obviously, bugs in CScheduler itself could be missed. Though, if we are worried about that, a CScheduler fuzz target should be added. Unrelated to this pull, if we want to test multiple threads to detect lock issues and data races, I presume we’d have to add a specific fuzz target for those as well and add a tsan-fuzz CI task.
fanquake merged this on Mar 21, 2025
fanquake closed this on Mar 21, 2025
TheCharlatan commented at 5:36 am on March 21, 2025: contributor

Post-merge ACK
maflcko deleted the branch on Mar 21, 2025

Type	Reviewers
ACK	marcofleon, dergoegge

Contributors
maflcko DrahtBot dergoegge fanquake marcofleon TheCharlatan

Labels
Tests

fuzz: Use immediate task runner to increase fuzz stability #31841

Code Coverage & Benchmarks

Reviews