Consider pruning corpus to speed up the Bitcoin Core CI fuzzing job?

practicalswift commented at 10:20 am on August 19, 2020: contributor

Should we prune the corpus to get rid of large slow inputs that reach code that can be reached using smaller (and thus quicker-to-process) inputs?

The basic idea is to reduce the total size of the qa-assets inputs while keeping the same achieved coverage.

I think that might speed up the Bitcoin Core CI fuzzing job significantly.

Sounds good? Any opposition? :)

If we agree that it is a good idea I’m willing to do the job :)

darosior commented at 2:26 pm on August 20, 2020: member

Sounds good!

How would you proceed ? I would have expected merge=1 to choose the smaller input when choosing between 2 seeds with equal coverage, but maybe it’s not ?

practicalswift commented at 2:44 pm on August 20, 2020: contributor

@darosior -merge=1 will never delete any files, so it cannot replace an existing big seed file with a smaller seed file that achieves the same coverage. Thus the first input in the repo that achieved a certain coverage will be kept.

What needs to be done is to create a new seed corpus from scratch by -merge:ing in the existing set of seeds together with other (potentially smaller) seeds. The idea is that this process will give a more optimal set of seeds. It will also remove duplicates (two seeds achieving the same coverage) in our repo.

practicalswift commented at 7:53 am on August 28, 2020: contributor

Seeking input from fellow fuzzing enthusiasts: 👍 , 👎 or neutral? :)

I’m willing to do the work! :)

maflcko commented at 8:14 am on August 28, 2020: contributor

How are you going to ensure that coverage is measured across different architectures/operating systems when merging the inputs?

sipa commented at 8:18 am on August 28, 2020: contributor

One reason not to (in general), is that the fuzzing corpus is built up through many versions the code went through, and thus may contain implicit knowledge about bugs that once existed but were fixed (not things that caused a crash; just inputs that were relevant in trigger code paths related to bugs) - which may help finding them later if accidentally reintroduced.

If we’d do this, I’d suggest only doing it for fuzzers whose corpus size has grown very large.

I’d also suggest doing the minimization (create empty dir, -merge=1 into it) using both binaries with and without sanitizers - they tend to have different inputs that matter to them. EDIT: as Marco mentions, multiple architectures would be useful too.

So what you’d do is starting from corpus dir C:

Create new empty dir N
For each binary/platform/…:
- ./fuzzer -merge=1 -use_value_profile=1 N C
replace C with N

maflcko commented at 10:03 am on October 26, 2020: contributor

Another issue to consider is how to deal with stateful logic and pruning seeds based on (line/branch) coverage. For example the script interpreter has a stack to keep state during script execution. However, coverage based fuzzers are to the best of my knowledge not aware of state and might thus prune away seeds that might be the only ones exercising a specific code path through the interpreter that is already (line/branch) covered by different seed snippets.

Edit: Discussion about inputs with useful features that get removed because they don’t increase coverage: https://reviews.llvm.org/D86577

Crypt-iQ commented at 1:34 am on November 23, 2020: contributor

I agree with @MarcoFalke and @sipa - there may be interesting code paths with sanitizers or state during script execution that are pruned (I don’t think any fuzzers track path coverage either!). Though maybe specifically for CI fuzzing jobs, this qa-assets/fuzz_seed_corpus directory can be slightly pruned and put in a different directory? And fuzzing enthusiasts can run all the data at home without comprising any coverage.

maflcko commented at 8:08 am on November 23, 2020: contributor

Though practically, everyone is using libFuzzer to generate inputs (or similar fuzz engines that can’t fully track path coverage), so I’d be surprised if path coverage decreases after cleaning the seed files.

Now that the seed files change format occasionally, and the ci starts to reach its timeout limit when iterating over all historic seeds, we should start looking into cleaning the folder.

Keeping the historic seeds in a separate folder for now would be an option, but I don’t think it matters in practice, because we all use pretty much the same fuzz engine with the same coverage statistics.

maflcko commented at 7:58 am on December 22, 2020: contributor

Another thought: Ideally the pruning should be reproducible, which is currently impossible because the fuzz targets don’t achieve reproducible coverage.

Crypt-iQ commented at 2:02 pm on January 8, 2021: contributor

I’ve noticed that with a larger in-memory corpus, libFuzzer will take up more rss, and in some cases grind to a near halt with ~1 exec/s (process_messages harness).

maflcko commented at 5:28 pm on January 26, 2021: contributor

We can no longer iterate over all seeds in less than 2 CPU hours, so we might have to do this soon. Otherwise, CI checks will need to be disabled or otherwise reduced.

practicalswift commented at 6:15 pm on January 26, 2021: contributor

We have accumulated quite a large number of extremely large inputs that takes ages to process due to sheer size. A lot of these inputs do not add any meaningful coverage AFAICT.

In order to get meaningful fuzzing run times I usually do …

0find qa-assets/fuzz_seed_corpus/ -type f -size +1000k -delete

… before fuzzing. Where 1000k can be 100k, 10k or even 4k (libFuzzer default) depending on how much fuzzing runtime I want to spend.

It is basically impossible to process qa-assets/ with say -fsanitize=integer without doing so.

Regardless of the pruning I suggested in this issue I think we should at least kill off the “large-for-no-reason” seeds :)

WDYT? :)

laanwj closed this on Jan 27, 2021

maflcko commented at 6:52 am on January 28, 2021: contributor

Would be nice if someone could test #44

Consider pruning corpus to speed up the Bitcoin Core CI fuzzing job? #25