Add a seed corpus generation option to the fuzzing test_runner #19659

pull darosior wants to merge 1 commits into bitcoin:master from darosior:20/08/04/fuzz_runner_gen_seeds changing 1 files +60 −12
  1. darosior commented at 4:46 PM on August 4, 2020: member

    This adds a startup option to test/fuzz/test_runner.py which allows to generate seed corpus to the passed seed_dir instead of using them.

  2. in test/fuzz/test_runner.py:111 in 5adbd1b2a8 outdated
     107 | @@ -100,6 +108,11 @@ def main():
     108 |  
     109 |      logging.info("{} of {} detected fuzz target(s) selected: {}".format(len(test_list_selection), len(test_list_all), " ".join(test_list_selection)))
     110 |  
     111 | +    if args.generate:
    


    MarcoFalke commented at 5:03 PM on August 4, 2020:

    Any reason to not reuse the fuzz_pool below? Benefit is also that it will check for libFuzzer and abort early if not found


    darosior commented at 5:09 PM on August 4, 2020:

    The main reason was to avoid the seedless check just below


    MarcoFalke commented at 5:11 PM on August 4, 2020:

    The check is just a debug print, no? If it does any harm, you could disable it when args.generate.

    Though, I like the libFuzzer check to run before any libFuzzer specific args are passed to a subprocess.


    darosior commented at 5:13 PM on August 4, 2020:

    Ok, will change

  3. in test/fuzz/test_runner.py:182 in 5adbd1b2a8 outdated
     177 | +        def job(command):
     178 | +            try:
     179 | +                logging.debug("Running {}..".format(" ".join(command)))
     180 | +                logging.debug("Output: {}\n\n".format(
     181 | +                    subprocess.run(command, check=True,
     182 | +                                   stderr=subprocess.PIPE, timeout=n*5).stderr
    


    MarcoFalke commented at 5:04 PM on August 4, 2020:

    Any reason to set a time based timeout here, but max runs by number below?


    darosior commented at 5:12 PM on August 4, 2020:

    Yeah, the max runs is for the number of targets per batch to run in parallel, while this timeout is because some targets will have a harder time finding new coverage before some others and is to avoid obstructing the queue with them.


    darosior commented at 5:15 PM on August 4, 2020:

    Hmm actually i used something different and timed out slow fuzzing processes because they were blocking the whole next batch.. With a thread pool maybe it's not necessary ?

  4. MarcoFalke commented at 5:05 PM on August 4, 2020: member

    Concept ACK

  5. DrahtBot added the label Tests on Aug 4, 2020
  6. darosior force-pushed on Aug 4, 2020
  7. darosior commented at 5:51 PM on August 4, 2020: member

    Ok, removed the early return in main to keep the libfuzzer check, and removed the timeout

  8. in test/fuzz/test_runner.py:176 in 749f6cb52c outdated
     171 | +    """Generates new corpus seeds.
     172 | +
     173 | +    Run {n} {targets} at a time without input, and outputs the generated corpus
     174 | +    seeds to {seed_dir}.
     175 | +    """
     176 | +    logging.info(f"Generating corpus seeds to {seed_dir}")
    


    Crypt-iQ commented at 5:26 AM on August 5, 2020:

    Extraneous f in f"Generating..."


    darosior commented at 9:12 AM on August 5, 2020:

    That's an fstring.


    darosior commented at 9:17 AM on August 5, 2020:

    Had to drop it anyway because we need to support python 3.5 !

  9. Crypt-iQ commented at 5:29 AM on August 5, 2020: contributor

    After locally fixing the log statement, I was able to run this and get a starting corpus pretty quickly. Why wouldn't you just use the populated corpus from qa-assets?

  10. in test/fuzz/test_runner.py:63 in 749f6cb52c outdated
      55 | @@ -56,6 +56,14 @@ def main():
      56 |          '--m_dir',
      57 |          help='Merge inputs from this directory into the seed_dir. Needs /target subdirectory.',
      58 |      )
      59 | +    parser.add_argument(
      60 | +        '-g',
      61 | +        '--generate',
      62 | +        action='store_true',
      63 | +        help='Create new corpus seeds by running the given targets by batches '
    


    MarcoFalke commented at 7:04 AM on August 5, 2020:
            help='Create new corpus seeds (or extend the existing ones) by running the given targets by batches '
    

    This will also take into account existing seeds, right?


    darosior commented at 9:12 AM on August 5, 2020:

    Yep

  11. in test/fuzz/test_runner.py:170 in 749f6cb52c outdated
     166 | @@ -152,6 +167,35 @@ def main():
     167 |          )
     168 |  
     169 |  
     170 | +def generate_corpus_seeds(fuzz_pool, build_dir, seed_dir, targets, n):
    


    MarcoFalke commented at 7:06 AM on August 5, 2020:
    def generate_corpus_seeds(*, fuzz_pool, build_dir, seed_dir, targets, n):
    

    python is not type safe, so please use named arguments to not accidentally mix up the order of the args and to simplify review


    MarcoFalke commented at 7:08 AM on August 5, 2020:

    n is unused, no?


    darosior commented at 9:13 AM on August 5, 2020:

    Done


    darosior commented at 9:14 AM on August 5, 2020:

    Yes, forgot to remove it along with the ThreadPool. Thanks!

  12. in test/fuzz/test_runner.py:181 in 749f6cb52c outdated
     176 | +    logging.info(f"Generating corpus seeds to {seed_dir}")
     177 | +
     178 | +    def job(command):
     179 | +        logging.debug("Running {}..".format(" ".join(command)))
     180 | +        logging.debug("Output: {}\n\n".format(
     181 | +            subprocess.run(command, check=True, stderr=subprocess.PIPE).stderr
    


    MarcoFalke commented at 7:09 AM on August 5, 2020:

    is stderr bytes or an utf-8 string?


    darosior commented at 9:21 AM on August 5, 2020:

    Bytes :( Will pass encoding to run().

  13. darosior force-pushed on Aug 5, 2020
  14. darosior commented at 9:29 AM on August 5, 2020: member

    After locally fixing the log statement, I was able to run this and get a starting corpus pretty quickly. Why wouldn't you just use the populated corpus from qa-assets?

    This is used to actually populate and extend the seed corpus, which we use to test against.

  15. in test/fuzz/test_runner.py:184 in 191a161e85 outdated
     179 | +
     180 | +    def job(command):
     181 | +        logging.debug("Running {}..".format(" ".join(command)))
     182 | +        logging.debug("Output: {}\n\n".format(
     183 | +            subprocess.run(command, check=True, stderr=subprocess.PIPE,
     184 | +                           text=True).stderr
    


    MarcoFalke commented at 9:36 AM on August 5, 2020:
                               universal_newlines=True).stderr
    

    otherwise it would be python3.7 only

  16. in test/fuzz/test_runner.py:182 in 191a161e85 outdated
     177 | +    """
     178 | +    logging.info("Generating corpus seeds to {}".format(seed_dir))
     179 | +
     180 | +    def job(command):
     181 | +        logging.debug("Running {}..".format(" ".join(command)))
     182 | +        logging.debug("Output: {}\n\n".format(
    


    MarcoFalke commented at 9:37 AM on August 5, 2020:

    Would probably be good to print the command again, otherwise it can't be correlated


    darosior commented at 10:04 AM on August 5, 2020:

    Done

  17. MarcoFalke approved
  18. MarcoFalke commented at 9:38 AM on August 5, 2020: member

    re-ACK. Might test soon

  19. darosior force-pushed on Aug 5, 2020
  20. Crypt-iQ commented at 10:40 AM on August 5, 2020: contributor

    After locally fixing the log statement, I was able to run this and get a starting corpus pretty quickly. Why wouldn't you just use the populated corpus from qa-assets?

    This is used to actually populate and extend the seed corpus, which we use to test against.

    Is there ever a scenario in which you want to run this from scratch? I would imagine you could always just use your existing coverage instead of running from scratch unless I'm thinking about this all wrong.

  21. darosior commented at 12:51 PM on August 5, 2020: member

    Is there ever a scenario in which you want to run this from scratch?

    Generating from scratch, yes to create the initial corpus to test against. Generating not from scratch, yes to increase coverage to test against.

  22. Crypt-iQ commented at 1:09 PM on August 5, 2020: contributor

    Is there ever a scenario in which you want to run this from scratch?

    Generating from scratch, yes to create the initial corpus to test against. Generating not from scratch, yes to increase coverage to test against.

    I understand the second point, and I could see in some ways the first could be useful so Concept ACK from me. It's really for my own edification. Since some of these harnesses require very specific, structured input (see process_messages), the coverage won't be extensive if generated from scratch. Is the idea to eventually increase the coverage that can be generated from scratch via structured mutators and such?

  23. darosior force-pushed on Aug 8, 2020
  24. Crypt-iQ commented at 12:37 PM on August 8, 2020: contributor

    A bit unrelated to this PR, but thought I'd ask since it's kind of confusing me. When trying to run test/fuzz/test_runner.py with:

    ASAN_OPTIONS="detect_stack_use_after_return=1:check_initialization_order=1:strict_init_order=1" test/fuzz/test_runner.py seed_dir=~/qa-assets/fuzz_seed_corpus --loglevel DEBUG
    

    The fuzzing harness is only run a very small number of times which is certainly not enough to run every corpus input against the harnesses. Example output:

    Run pub_key_deserialize with args ['/root/bitcoin/src/test/fuzz/pub_key_deserialize', '-runs=1', 'seed_dir=/root/qa-assets/fuzz_seed_corpus/pub_key_deserialize']INFO: Seed: 3809663835
    INFO: Loaded 1 modules   (504 inline 8-bit counters): 504 [0x55ed26ff4ba8, 0x55ed26ff4da0),
    INFO: Loaded 1 PC tables (504 PCs): 504 [0x55ed26ff4da0,0x55ed26ff6d20),
    INFO:        3 files found in seed_dir=/root/qa-assets/fuzz_seed_corpus/pub_key_deserialize
    INFO: -max_len is not provided; libFuzzer will not generate inputs larger than 4096 bytes
    INFO: seed corpus: files: 3 min: 2b max: 4b total: 9b rss: 28Mb
    [#4](/bitcoin-bitcoin/4/)      INITED cov: 22 ft: 27 corp: 3/9b exec/s: 0 rss: 29Mb
    [#4](/bitcoin-bitcoin/4/)      DONE   cov: 22 ft: 27 corp: 3/9b lim: 4 exec/s: 0 rss: 29Mb
    Done 4 runs in 0 second(s)
     
    Run process_messages with args ['/root/bitcoin/src/test/fuzz/process_messages', '-runs=1', 'seed_dir=/root/qa-assets/fuzz_seed_corpus/process_messages']INFO: Seed: 3809693158
    INFO: Loaded 1 modules   (177287 inline 8-bit counters): 177287 [0x55751462da08, 0x557514658e8f),
    INFO: Loaded 1 PC tables (177287 PCs): 177287 [0x557514658e90,0x55751490d700),
    INFO:        8 files found in seed_dir=/root/qa-assets/fuzz_seed_corpus/process_messages
    INFO: -max_len is not provided; libFuzzer will not generate inputs larger than 4096 bytes
    INFO: seed corpus: files: 8 min: 1b max: 4b total: 18b rss: 120Mb
    [#9](/bitcoin-bitcoin/9/)      INITED cov: 611 ft: 1382 corp: 6/12b exec/s: 0 rss: 130Mb
    [#9](/bitcoin-bitcoin/9/)      DONE   cov: 611 ft: 1382 corp: 6/12b lim: 4 exec/s: 0 rss: 130Mb
    Done 9 runs in 0 second(s)
    

    An argument could be made for -runs=1 being raised as running a couple inputs against a fuzzing harness isn't too effective I feel like. Could raise it a little to get some decent amount of fuzzing coverage every time this python script is run. It can't be raised too much as I know some fuzzing harnesses (process_messages) take ~10 hours to go through every single input file. Thoughts?

    EDIT: It's just a libFuzzer option and when -runs=1, it doesn't log all the output.

  25. in test/fuzz/test_runner.py:64 in 395e7d38db outdated
      55 | @@ -56,6 +56,14 @@ def main():
      56 |          '--m_dir',
      57 |          help='Merge inputs from this directory into the seed_dir. Needs /target subdirectory.',
      58 |      )
      59 | +    parser.add_argument(
      60 | +        '-g',
      61 | +        '--generate',
      62 | +        action='store_true',
      63 | +        help='Create new corpus seeds (or extend the existing ones) by running'
      64 | +             ' the given targets for a finite amount of time. Outputs them to '
    


    Crypt-iQ commented at 4:25 PM on August 9, 2020:

    These no longer run for a finite amount of time, maybe finite amount of runs instead?



    Crypt-iQ commented at 11:57 AM on August 10, 2020:

    Oh sorry, I meant to modify the comment


    darosior commented at 12:01 PM on August 10, 2020:

    Oh yeah, you right: I misunderstood :)


    darosior commented at 12:03 PM on August 10, 2020:

    Amended to outline that i meant time not as in timestamp but as in "1 time, 2 times":

    diff --git a/test/fuzz/test_runner.py b/test/fuzz/test_runner.py
    index cfe2e67b5..5ad1c9b9f 100755
    --- a/test/fuzz/test_runner.py
    +++ b/test/fuzz/test_runner.py
    @@ -61,8 +61,8 @@ def main():
             '--generate',
             action='store_true',
             help='Create new corpus seeds (or extend the existing ones) by running'
    -             ' the given targets for a finite amount of time. Outputs them to '
    -             'the passed seed_dir.'
    +             ' the given targets for a finite number of times. Outputs them to'
    +             ' the passed seed_dir.'
         )
     
         args = parser.parse_args()
    
  26. Crypt-iQ approved
  27. Crypt-iQ commented at 6:29 PM on August 9, 2020: contributor

    Tested ACK 395e7d3

    Running with --generate and no seed directory onprocess_messages:

    ['/root/bitcoin/src/test/fuzz/process_messages', '-runs=100000', '/root/seed-run/process_messages'] output: INFO: Seed: 4176416250
    INFO: Loaded 1 modules   (177403 inline 8-bit counters): 177403 [0x561f7c684dc8, 0x561f7c6b02c3),
    INFO: Loaded 1 PC tables (177403 PCs): 177403 [0x561f7c6b02c8,0x561f7c965278),
    INFO:        0 files found in /root/seed-run/process_messages
    INFO: -max_len is not provided; libFuzzer will not generate inputs larger than 4096 bytes
    INFO: A corpus is not provided, starting from an empty corpus
    [#2](/bitcoin-bitcoin/2/)      INITED cov: 591 ft: 592 corp: 1/1b exec/s: 0 rss: 76Mb
    [#3](/bitcoin-bitcoin/3/)      NEW    cov: 594 ft: 912 corp: 2/2b lim: 4 exec/s: 0 rss: 77Mb L: 1/1 MS: 1 ChangeByte-
    [#4](/bitcoin-bitcoin/4/)      NEW    cov: 594 ft: 1346 corp: 3/3b lim: 4 exec/s: 0 rss: 77Mb L: 1/1 MS: 1 ChangeByte-
    [#5](/bitcoin-bitcoin/5/)      NEW    cov: 594 ft: 1347 corp: 4/4b lim: 4 exec/s: 0 rss: 77Mb L: 1/1 MS: 1 ChangeBit-
    [#6](/bitcoin-bitcoin/6/)      NEW    cov: 597 ft: 1350 corp: 5/6b lim: 4 exec/s: 0 rss: 77Mb L: 2/2 MS: 1 CopyPart-
    ...
    [#95703](/bitcoin-bitcoin/95703/)  REDUCE cov: 889 ft: 3018 corp: 233/14597b lim: 198 exec/s: 2990 rss: 80Mb L: 142/198 MS: 1 EraseBytes-
    [#95823](/bitcoin-bitcoin/95823/)  REDUCE cov: 889 ft: 3018 corp: 233/14573b lim: 198 exec/s: 2994 rss: 80Mb L: 133/198 MS: 5 InsertRepeatedBytes-InsertByte-CopyPart-ChangeBit-EraseBytes-
    [#95980](/bitcoin-bitcoin/95980/)  REDUCE cov: 889 ft: 3018 corp: 233/14568b lim: 198 exec/s: 2999 rss: 80Mb L: 143/198 MS: 2 ChangeBinInt-EraseBytes-
    [#96372](/bitcoin-bitcoin/96372/)  REDUCE cov: 889 ft: 3018 corp: 233/14545b lim: 198 exec/s: 3011 rss: 80Mb L: 98/198 MS: 2 ChangeBinInt-EraseBytes-
    [#96999](/bitcoin-bitcoin/96999/)  REDUCE cov: 889 ft: 3018 corp: 233/14522b lim: 198 exec/s: 3031 rss: 80Mb L: 158/198 MS: 2 InsertByte-EraseBytes-
    [#97228](/bitcoin-bitcoin/97228/)  REDUCE cov: 889 ft: 3018 corp: 233/14519b lim: 198 exec/s: 3038 rss: 80Mb L: 51/198 MS: 4 CMP-ShuffleBytes-EraseBytes-ShuffleBytes- DE: "\xff\xff\xff\xff"-
    [#97241](/bitcoin-bitcoin/97241/)  REDUCE cov: 889 ft: 3018 corp: 233/14507b lim: 198 exec/s: 3038 rss: 80Mb L: 131/198 MS: 3 ChangeBinInt-InsertRepeatedBytes-EraseBytes-
    [#97387](/bitcoin-bitcoin/97387/)  REDUCE cov: 889 ft: 3018 corp: 233/14485b lim: 198 exec/s: 3043 rss: 80Mb L: 146/198 MS: 1 EraseBytes-
    [#98259](/bitcoin-bitcoin/98259/)  NEW    cov: 889 ft: 3019 corp: 234/14655b lim: 205 exec/s: 3070 rss: 80Mb L: 170/198 MS: 2 ChangeBit-InsertRepeatedBytes-
    [#99240](/bitcoin-bitcoin/99240/)  REDUCE cov: 889 ft: 3019 corp: 234/14652b lim: 212 exec/s: 3007 rss: 80Mb L: 31/198 MS: 1 EraseBytes-
    [#99898](/bitcoin-bitcoin/99898/)  REDUCE cov: 889 ft: 3019 corp: 234/14650b lim: 212 exec/s: 3027 rss: 80Mb L: 87/198 MS: 3 ShuffleBytes-ChangeBit-EraseBytes-
    [#99917](/bitcoin-bitcoin/99917/)  NEW    cov: 889 ft: 3020 corp: 235/14861b lim: 212 exec/s: 3027 rss: 80Mb L: 211/211 MS: 4 InsertByte-InsertRepeatedBytes-InsertRepeatedBytes-CopyPart-
    [#100000](/bitcoin-bitcoin/100000/) DONE   cov: 889 ft: 3020 corp: 235/14861b lim: 212 exec/s: 3030 rss: 80Mb
    

    Since it's super structured data, coverage isn't that high (running with qa-assets corpus and some local seeds, I get around cov: 4700). I think anything that has low coverage with no starting seed directory could be a target for structured fuzzing?

  28. darosior force-pushed on Aug 10, 2020
  29. darosior force-pushed on Aug 11, 2020
  30. darosior force-pushed on Aug 13, 2020
  31. practicalswift commented at 3:49 PM on August 14, 2020: contributor

    Tested ACK 0b83151f1042f9d3978ad07a573e89847733b456

    Thanks for improving test/fuzz/test_runner.py -- nice stuff! :)

  32. in test/fuzz/test_runner.py:182 in 0b83151f10 outdated
     177 | +    """
     178 | +    logging.info("Generating corpus seeds to {}".format(seed_dir))
     179 | +
     180 | +    def job(command):
     181 | +        logging.debug("Running {}..".format(" ".join(command)))
     182 | +        logging.debug("{} output: {}\n\n".format(
    


    MarcoFalke commented at 9:23 AM on August 22, 2020:

    Seems odd to append two dots to the command. Could at least add ' around the command?

    diff --git a/test/fuzz/test_runner.py b/test/fuzz/test_runner.py
    index 5ad1c9b9fa..f2a8aefd05 100755
    --- a/test/fuzz/test_runner.py
    +++ b/test/fuzz/test_runner.py
    @@ -178,9 +178,9 @@ def generate_corpus_seeds(*, fuzz_pool, build_dir, seed_dir, targets):
         logging.info("Generating corpus seeds to {}".format(seed_dir))
     
         def job(command):
    -        logging.debug("Running {}..".format(" ".join(command)))
    -        logging.debug("{} output: {}\n\n".format(
    -            command,
    +        logging.debug("Running '{}' ...\n".format(" ".join(command)))
    +        logging.debug("Command '{}' output:\n{}\n".format(
    +            ' '.join(command),
                 subprocess.run(command, check=True, stderr=subprocess.PIPE,
                                universal_newlines=True).stderr
             ))
    

    MarcoFalke commented at 9:24 AM on August 22, 2020:

    Also a newline between the two logs would help, as they are unrelated in the overall log once more than one thread is running (see above suggested diff)


    darosior commented at 4:01 PM on August 23, 2020:

    Added

  33. in test/fuzz/test_runner.py:200 in 0b83151f10 outdated
     195 | +            target_seed_dir,
     196 | +        ]
     197 | +        futures.append(fuzz_pool.submit(job, command))
     198 | +
     199 | +    for future in as_completed(futures):
     200 | +        continue
    


    MarcoFalke commented at 9:38 AM on August 22, 2020:

    Any reason to ignore failures?

            future.result()
    

    darosior commented at 3:59 PM on August 23, 2020:

    Nope, corrected.

  34. MarcoFalke commented at 9:38 AM on August 22, 2020: member

    Some questions

    ac9927d517c833eb01e02ada097d095559f1f435

  35. darosior force-pushed on Aug 23, 2020
  36. darosior force-pushed on Aug 23, 2020
  37. test/fuzz: add a seed corpus generation option to the test_runner
    Signed-off-by: Antoine Poinsot <darosior@protonmail.com>
    15ae4a17c4
  38. darosior force-pushed on Aug 23, 2020
  39. MarcoFalke commented at 6:02 AM on August 24, 2020: member

    ACK 15ae4a17c4

  40. MarcoFalke merged this on Aug 24, 2020
  41. MarcoFalke closed this on Aug 24, 2020

  42. jasonbcox referenced this in commit 10f2337d21 on Nov 6, 2020
  43. darosior deleted the branch on Jan 12, 2021
  44. DrahtBot locked this on Aug 16, 2022

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-24 21:14 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me