Shrink descriptor_parse size, increase coverage #104

pull Sjors wants to merge 1 commits into bitcoin-core:main from Sjors:2022/02/descriptor_parse changing 3875 files +1270 −2583
  1. Sjors commented at 12:53 pm on February 8, 2023: member

    This shrinks the number of input files for descriptor_parse, and reduces storage from 234MB to 172MB.

    libFuzzer options: -use_value_profile=1 -prefer_small=1 -shuffle=0

    --with-sanitizers=address,fuzzer,undefined

    Bitcoin Core de1d1895346ed27f471386011d7ff64508f8b39c

    After running the fuzzer for a bit, I moved the original input files into the directory where I kept my additional inputs, and then -merge’d back into the qa-assets directory.

    Let me know if this methodology is sound*. If so, I’ll run it a bit longer to see if that reduces storage more.

    * = maybe not, because I only ran on my own architecture?

  2. Expand descriptor_parse coverage
    This shrinks the number of input files, and reduces storage from 234MB to 172MB.
    
    libFuzzer options:
    -use_value_profile=1 -prefer_small=1 -shuffle=0
    
    --with-sanitizers=address,fuzzer,undefined
    
    Bitcoin Core de1d1895346ed27f471386011d7ff64508f8b39c
    047a9e7530
  3. Sjors commented at 12:54 pm on February 8, 2023: member

    cc @dergoegge @darosior

    Some relevant discussion in #25.

  4. darosior commented at 12:58 pm on February 8, 2023: member

    Can you generate a coverage report to make sure this did not in fact diminish coverage?

    Maybe @MarcoFalke can give guidance too. I’m far from an expert about fuzzing.

  5. maflcko commented at 1:12 pm on February 8, 2023: contributor
    No strong opinion, but I’d prefer if deleting inputs was only done every 6 months (before a release) with a script to aid review and reproducibility. See https://github.com/bitcoin-core/qa-assets/pull/97
  6. Sjors commented at 1:57 pm on February 8, 2023: member

    Happy to hold this in draft until closer to the release. @sipa wrote in the other issue:

    I’d also suggest doing the minimization (create empty dir, -merge=1 into it) using both binaries with and without sanitizers - they tend to have different inputs that matter to them.

    I re-ran the merge command on a binary with only the fuzzer sanitizer. This resulted in additional inputs, which I’ve pushed as a separate commit. It caused the inputs to grow to 260M, which is bigger than what this PR started with.

    However, I had not run the fuzzer itself without sanitizers, so perhaps the merge could not find smaller inputs relevant for this configuration. The answer seems to be no, I ended up with 260M again.

    So this approach doesn’t scale well.

    Also, if then we also want to hold on to inputs that previously covered bugs, to catch a regression, the whole thing becomes a combinatorial explosion.

    Perhaps it’s best in practice if individual developers keep a larger corpus on their own machine(s). In particular, anytime this repo is pruned, they could hold on to the old inputs.

  7. Sjors closed this on Feb 8, 2023

  8. maflcko commented at 2:01 pm on February 8, 2023: contributor

    Also, if then we also want to hold on to inputs that previously covered bugs, to catch a regression, the whole thing becomes a combinatorial explosion.

    OSS-Fuzz does this already. Also, if a bug was found previously by a CPU, it will likely be found again, given enough CPU. So I don’t worry too much about this. And if the “regression” input was invalidated anyway, holding on to it adds no value.

  9. maflcko commented at 2:03 pm on February 8, 2023: contributor
    No need to close this. If there are inputs that add coverage, they can be added. Especially if they add line coverage or branch coverage.
  10. Sjors reopened this on Feb 8, 2023

  11. Sjors commented at 2:05 pm on February 8, 2023: member
    Should I drop the part that deletes existing inputs then?
  12. maflcko commented at 2:10 pm on February 8, 2023: contributor
  13. Sjors commented at 2:22 pm on February 8, 2023: member
    That maintainer script won’t work here though. I didn’t just merge existing inputs, I generated new ones. The only way to make that deterministic would be to first add all the new ones I generated, which would bloat the repo (my unmerged descriptor_parse corpus is 1 GB with less than an hour of fuzzing).
  14. dergoegge commented at 2:23 pm on February 8, 2023: member

    Also, if then we also want to hold on to inputs that previously covered bugs, to catch a regression, the whole thing becomes a combinatorial explosion.

    OSS-Fuzz does this already. Also, if a bug was found previously by a CPU, it will likely be found again, given enough CPU. So I don’t worry too much about this. And if the “regression” input was invalidated anyway, holding on to it adds no value.

    I think for targets that have a common input format (like a string that represents a descriptor for example), it could make sense to to make sure that initial seeds and “regression” inputs are not lost (which might happen when merging into an empty corpus). I guess this would mostly be relevant for inputs that can’t be found by CPU, although making sure they aren’t delted reduces the time to detect regressions, as they would immediately be caught in the CI.

  15. Sjors commented at 2:25 pm on February 8, 2023: member
    Come to think of it, for the purpose of the CI, we obviously only need to worry about the CI architecture and chosen set of sanitizers.
  16. maflcko commented at 2:25 pm on February 8, 2023: contributor

    The only way to make that deterministic

    yeah, I don’t think we need to make that deterministic. Just using libfuzzer to merge your 1GB folder into this repo should be good enough.

  17. maflcko commented at 2:28 pm on February 8, 2023: contributor

    make sure that initial seeds and “regression” inputs are not lost

    I have no idea how to achieve this reliably. We’d need a way to assert coverage on those to avoid invalidation. Pull requests or proposals welcome :)

  18. maflcko commented at 2:41 pm on February 13, 2023: contributor

    @dergoegge I guess there could be something like:

     0class CoverageAssertion{
     1bool m_coverage{
     2#ifdef ASSERT_COVERAGE
     3false
     4#else 
     5true // nothing to check
     6#endif 
     7};
     8~CoverageAssertion(){
     9assert(m_coverage);
    10}
    11void Set() {m_coverage=true;}
    12};
    13static CoverageAssertion g_coverage_assertion{};
    

    And in the code:

    0if (condition) g_coverage_assertion.Set(); // covered this condition at least once
    
  19. maflcko commented at 2:41 pm on February 13, 2023: contributor
    @Sjors Are you still working on this?
  20. Sjors commented at 12:41 pm on February 14, 2023: member
    Not at the moment, it’s too unclear what the end result should look like. I can keep it open until closer to the next release-branch off.
  21. maflcko commented at 1:06 pm on February 14, 2023: contributor

    it’s too unclear what the end result should look like.

    It should look like any other pull to this repo: +x-0

  22. Sjors commented at 10:25 am on February 15, 2023: member
    I’ll make a new PR to add the new inputs, because the whole point of this PR was to shrink the size, not to increase it. Alternatively I can wait until there’s a way to safely prune.
  23. maflcko commented at 10:27 am on February 15, 2023: contributor

    Alternatively I can wait until there’s a way to safely prune

    Pretty sure this isn’t going to happen at all or at least any time soon.

  24. Sjors commented at 10:29 am on February 15, 2023: member
    Ok, I’ll run it a bit longer and with some other parameters and make a new PR soon(tm).
  25. Sjors closed this on Feb 15, 2023


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin-core/qa-assets. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-10-30 01:25 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me