brainstorm: Reducing the size of this repo

dergoegge commented at 9:25 am on May 24, 2023: member

This repository is very large (~16GB atm) and I think there are a bunch of things we could do to improve that.

Prune the git history, .git is currently at 4GB. (we don’t really need the history/we could archive to the history to a separate repo)
Compress corpora (~6GB gzip)
Avoid large inputs / have separate repo for those

The biggest downside to the size currently is that we pull this repo in our CI jobs (oss-fuzz as well) which is a big overhead.

Maybe we setup an automated mirror repo that has the compressed corpora and no git history?

dergoegge commented at 9:25 am on May 24, 2023: member

@MarcoFalke thoughts?

dergoegge commented at 9:36 am on May 24, 2023: member

I guess cloning with --depth 1 already works quite well

maflcko commented at 9:41 am on May 24, 2023: contributor

Prune the git history, .git is currently at 4GB. (we don’t really need the history/we could archive to the history to a separate repo)

There is a size limit of ~10 GB on GitHub for repos, so once we reach that point, I doubt we’ll be able to host the history in a repo. Someone could put up an tar_gz of the .git folder on their personal website maybe?

Compress corpora (~6GB gzip)

This will make everything worse. It will make it impossible for git to track single fuzz inputs and de-duplicate them in the git history. Adding a single fuzz input requires a full copy of all fuzz inputs of the same fuzz target. Also, it makes is harder to browse and use.

Avoid large inputs / have separate repo for those

Maybe, but this will also make it harder to browse, use and contribute.

The biggest downside to the size currently is that we pull this repo in our CI jobs (oss-fuzz as well) which is a big overhead.

Not sure if this is a problem or whether it can be fixed. As you say --depth=1 is already used and CI machines generally have a fast connection.

dergoegge closed this on May 24, 2023

maflcko commented at 10:08 am on May 24, 2023: contributor

There is a size limit of ~10 GB on GitHub for repos, so once we reach that point, I doubt we’ll be able to host the history in a repo. Someone could put up an tar_gz of the .git folder on their personal website maybe?

An alternative to squashing the history at that point may be to move the repo to GitLab, which has a premium plan for 50 GB or 250 GB storage.

murchandamus commented at 5:31 pm on October 4, 2023: contributor

Is git really the right tool to manage a data collection with so many files? Unfortunately, I don’t have a better idea (yet), but it does seem terribly slow in interactions with this repository.

dergoegge commented at 5:37 pm on October 4, 2023: member

I’ve recently been using the afl++ tooling more and noticed that afl-cmin (the afl++ corpus minimizer) produces much smaller corpora. This is explained by libFuzzer using more than coverage as feedback, which ends up bloating the corpora with inputs that achieve the same coverage but have otherwise interesting features (interesting according to libFuzzer). So we could consider using afl-cmin but I’m not sure how we would evaluate whether or not this is a good idea (besides the corpora size).

murchandamus commented at 5:46 pm on October 4, 2023: contributor

It might be worth fuzzing with either afl++ or libfuzzer but then only uploading what afl-cmin considers to be increasing coverage?

maflcko commented at 8:29 am on October 5, 2023: contributor

One could also remove the -use_value_profile=1 setting from the merge script?

sipa commented at 12:13 pm on October 5, 2023: contributor

I believe the current procedure around adding assets (which involves reducing w.r.t. the existing assets) results in wasted effort. However, the better alternative would result in far more churn, and as long as we’re tied to git for storage, we probably don’t want that.

I believe that when one calls fuzz -merge=1 DIR1 DIR2 DIR3, assets in DIR2 and DIR3 which add coverage or features w.r.t. DIR1 get added to it. However, assets in DIR2 or DIR3 which are only smaller but don’t increase coverage beyond what DIR1 combined already has, do not. This means that reductions found by local fuzzing (“REDUCE” lines) don’t actually make it past the merging stage, unless they indirectly give rise to more coverage/features with a future mutation. To get reductions in, you’d need to use a new empty DIR1 rather than the existing qa-assets dir as DIR1, but that will cause merging to likely throw out existing entries often too.

I believe an alternative procedure would be better, where people can submit new seeds without merging, and they just get added to the corpus. Once the corpus gets too big, or on a regular basis (e.g. before or after release), the project compacts them using -merge into a new enpty directory. However, I suspect this will cause more churn in the git repo than we want, so perhaps we should think if we can’t come up with an alternative that isn’t git based but still allows people to submit additions.

maflcko commented at 12:43 pm on October 5, 2023: contributor

I believe the current procedure around adding assets (which involves reducing w.r.t. the existing assets) results in wasted effort.

I think there are different procedures, which serve different purposes:

The qa-assets folder, which has the purpose to provide deterministic, reasonable coverage fuzz inputs to CI tasks
Continuous fuzzing, which may or may not use the qa-assets folder, and may or may not provide fuzz inputs back to the folder. The purpose here is to ever extend coverage, to find rare issues, and to protect against fuzz input format changes.

This means that reductions found by local fuzzing (“REDUCE” lines) don’t actually make it past the merging stage, unless they indirectly give rise to more coverage/features with a future mutation.

Good point. I guess having non-determinism in the fuzz targets and the -use_value_profile=1 bloat may cause small inputs to make it in regardless right now.

I believe an alternative procedure would be better, where people can submit new seeds without merging, and they just get added to the corpus.

Sgtm

Once the corpus gets too big, or on a regular basis (e.g. before or after release), the project compacts them using -merge into a new enpty directory.

This is already done and will be done this month again.

maflcko commented at 12:48 pm on October 5, 2023: contributor

I believe an alternative procedure would be better, where people can submit new seeds without merging, and they just get added to the corpus.

Sgtm

I guess one way to implement this, would be to have a “massive submit repo”, which is append-only and each submitter may or may not merge into an empty corpus.

Then, there is a regular task to cherry-pick the “minimal” qa-assets folder used for CI.

dergoegge commented at 1:55 pm on October 5, 2023: member

Just noting that the size of fuzz_seed_corpus is reduced to 1.6GB from 14GB using afl-cmin, coverage report: https://dergoegge.github.io/bitcoin-coverage/afl-cmin/fuzz.coverage/src/index.html

maflcko commented at 1:56 pm on October 5, 2023: contributor

How much would that be with libFuzzer with and without -use_value_profile?

dergoegge commented at 2:55 pm on October 5, 2023: member

How much would that be with libFuzzer with and without -use_value_profile?

8.1GB with, 1.9GB without.

So perhaps dropping use_value_profile from the merge script is the best low-effort solution for now?

maflcko commented at 2:17 pm on October 13, 2023: contributor

How much would be saved on top, if the -set_cover_merge=1 merge algorithm was used?

sipa commented at 2:53 pm on October 13, 2023: contributor

TIL -set_cover_merge=1. Not documented on https://llvm.org/docs/LibFuzzer.html?

maflcko commented at 2:55 pm on October 13, 2023: contributor

Yes, most options are not mentioned in the html help.

sipa commented at 2:57 pm on October 13, 2023: contributor

TIL that there is anything else than the html help. (-help=1 works…)

sipa commented at 3:08 pm on October 13, 2023: contributor

0MERGE-INNER: 686831 total files; 0 processed earlier; will process 686831 files now
1...
2[#686831](/bitcoin-core-qa-assets/686831/) DONE   cov: 4419 exec/s: 1064 rss: 182Mb
3MERGE-OUTER: successful in 1 attempt(s)
4MERGE-OUTER: the control file has 4141390298 bytes
5==3695960== ERROR: libFuzzer: out-of-memory (used: 2079Mb; limit: 2048Mb)

sipa reopened this on Oct 13, 2023

sipa closed this on Oct 13, 2023

maflcko commented at 3:18 pm on October 13, 2023: contributor

How much would be saved on top, if the -set_cover_merge=1 merge algorithm was used?

I am getting 1.6G vs 1.9G

dergoegge commented at 3:25 pm on October 13, 2023: member

I am getting 1.6G vs 1.9G

Could you also test this with -use_value_profile=1 -set_cover_merge=1? Maybe that could be a nice middle ground…

maflcko commented at 3:52 pm on October 13, 2023: contributor

That’d be 6.0G, which still seems a bit large, given that we may want to re-think the merge process (always merge into a clean folder). I think we also want to keep the append-only aspect of pull requests to simplify review.

As a next step, one could compare the runtime of the result of set_cover_merge vs merge.

maflcko commented at 4:07 pm on October 13, 2023: contributor

Running /usr/bin/time -f '%M KB, %S + %U' ./test/fuzz/test_runner.py on the result folder gives:

set_cover_merge_dir: 458140 KB, 39.68 + 1149.42
merge_dir: 556452 KB, 64.96 + 1699.33

Which seems like a massive speed up for the same coverage?

sipa commented at 4:33 pm on October 13, 2023: contributor

Benchmark with a big miniscript_smart corpus (228210 files, 1.2G), though accidentally including all of fuzz_seed_corpus into the merging:

-use_value_profile=0 -merge=1: 4064 files, 21M
-use_value_profile=1 -merge=1: 4869 files, 30M (by extending the result from the previous line)
-use_value_profile=0 -set_cover_merge=1: 605 files, 3.1M
-use_value_profile=1 -set_cover_merge=1: 1037 files, 7.7M (by extending the result from the previous line)

maflcko commented at 4:37 pm on October 13, 2023: contributor

Interesting find. So I guess -use_value_profile=1 -set_cover_merge=1 is doable for some targets. Though, the massive targets addrman, banman, block, … probably set the overall result when it comes to storage and compute used.

murchandamus commented at 5:05 pm on October 13, 2023: contributor

For creating a new corpus at the branch-off point, would it perhaps make sense to at least combine the crème de la crème? I.e. if each of us merged their active fuzzing directory to a new directory with -set_cover_merge and pushed that branch to their own repos, someone could combine the old bloated set and our individual best sets to create the new starting point?

maflcko commented at 8:07 am on October 14, 2023: contributor

Could you also test this with -use_value_profile=1 -set_cover_merge=1? Maybe that could be a nice middle ground…

I wonder if we did the wrong measurement, since we ran on a corpus generated with -merge=1. It may be better to re-run the measurements on a “dirty” corpus.

murchandamus commented at 4:55 pm on October 16, 2023: contributor

That would certainly be an interesting measurement as well

maflcko commented at 4:15 pm on October 17, 2023: contributor

Ok, I am getting that -set_cover_merge=1 can produce a smaller result if many “active/dirty” folders are used as input. So likely with higher coverage now, and even smaller:

-set_cover_merge=1 -use_value_profile=0: 1.0G
-set_cover_merge=1 -use_value_profile=1: 5.3G

Though, that seems still a bit large, given that the maximum repo size is apparently 10G on GitHub and 25G on GitLab.

So I guess we can keep using -set_cover_merge=1 -use_value_profile=0 in the merge script here.

murchandamus commented at 5:37 pm on October 17, 2023: contributor

Oh, that’s an interesting thought. That might also explain why for me the set_cover_merge from my active fuzzing directory was bigger than the merge from active fuzzing + qa-assets/main.

brainstorm: Reducing the size of this repo #130