fuzz, brainstorm: Individual binaries per harness

dergoegge commented at 4:43 pm on November 29, 2023: member

I would like to propose splitting up our fuzz binary into one binary per fuzz harness (or at least have an option to build separate binaries). This would primarily enable properly compiling with LTO, which would have several benefits:

In environments where individual binaries are required/desired (e.g. oss-fuzz), this has the benefit that each binary is as small as possible, resulting in less disk usage (nice for container images).
Enable afl++ to create a token dictionary at compile time, which only contains tokens that are found in code paths reachable by each individual harness.
Efficient and collision edge coverage tracking and more for afl++ see: https://github.com/AFLplusplus/AFLplusplus/blob/stable/instrumentation/README.lto.md.
People designing and developing fuzzing engines recommend not linking unreachable code into fuzz binaries: https://github.com/google/fuzzing/blob/master/docs/good-fuzz-target.md#unreachable-code.

The only downside would be that linking multiple binaries is slower (this was the only reason for switching to compiling only one binary), but I think we can work around this by simply making this optional.

To achieve this we would need to:

Change the build system to have an option to compile individual binaries
Change the fuzzing framework to (optionally) have FUZZ_TARGET include the actual fuzz entry point directly (e.g. LLVMFuzzerTestOneInput) instead of accumulating all harness functions into a global map
- This probably requires splitting each harness into its own file

dergoegge commented at 4:44 pm on November 29, 2023: member

@maflcko what do you think?

sipa commented at 4:46 pm on November 29, 2023: member

We used to do that :sweat_smile:, and it was a huge pain because the linking time for each of the binaries…

maflcko commented at 4:53 pm on November 29, 2023: member

In environments where individual binaries are required/desired (e.g. oss-fuzz [1]), this has the benefit that each binary is as small as possible, resulting in less disk usage (nice for container images).

I don’t think this works, because the other fuzz targets presumably only consume a minimal amount of binary space. Most of the space is consumed by all the core libraries (libtest_util, libutil, libkernel, libcommon, …)?

Enable afl++ to create a token dictionary at compile time, which only contains tokens that are found in code paths reachable by each individual harness.

I don’t think this works, due to the runtime registration indirection via the function pointer in a std::map. I think this is also a blocker for fuzz-introspector, etc.

Change the build system to have an option to compile individual binaries

The option already exists and OSS-Fuzz is using the option.

Obviously, pull requests with improvements are welcome to remove the std::map, but I don’t think reducing the size will be possible, unless you completely rework the internal library design of Bitcoin Core?

maflcko added the label Brainstorming on Nov 29, 2023

maflcko added the label Tests on Nov 29, 2023

dergoegge commented at 5:15 pm on November 29, 2023: member

I don’t think this works, because the other fuzz targets only consume a minimal amount of binary space. Most of the space is consumed by all the core libraries (libtest_util, libutil, libkernel, libcommon, …)

I tested this in a hacky way by commenting out all files except bitdeque.cpp from test_fuzz_fuzz_SOURCES in Makefile.test.include and then compiling with afl-clang-lto. The resulting binary is significantly smaller.

binary size: 23.5MB vs 112.6MB
auto dict size: 66 vs. 448 tokens
afl++ coverage map size: 13321 vs 264945 bytes

I don’t think this works, due to the runtime registration indirection via the function pointer in a std::map. I think this is also a blocker for fuzz-introspector, etc.

I suggested in the issue description to get rid of the map: Change the fuzzing framework to (optionally) have FUZZ_TARGET include the actual fuzz entry point directly (e.g. LLVMFuzzerTestOneInput) instead of accumulating all harness functions into a global map

The option already exists and OSS-Fuzz is using the option.

What option? We very obviously hack around not having such a option in oss-fuzz: https://github.com/google/oss-fuzz/blob/3bcef37498cc10279e14a639b03170a49929fc3c/projects/bitcoin-core/build.sh#L69-L82

unless you completely rework the internal library design of Bitcoin Core?

Does LTO not strip unused library components from the resulting binary?

dergoegge commented at 5:19 pm on November 29, 2023: member

We used to do that 😅, and it was a huge pain because the linking time for each of the binaries…

I have read through the past PRs (https://github.com/bitcoin/bitcoin/pull/20560, #15043) and I think if the link time is still an issue we can just make this optional and leave building one big binary the default.

maflcko commented at 5:25 pm on November 29, 2023: member

I tested this in a hacky way by commenting out all files except bitdeque.cpp from test_fuzz_fuzz_SOURCES in Makefile.test.include and then compiling with afl-clang-lto. The resulting binary is significantly smaller.

Nice. Thanks for testing. I wonder if you can get the same small binary size if you still compile the other fuzz targets. If LTO can strip them, that seems preferable and should make it easier to use this option.

What option? We very obviously hack around not having such a option in oss-fuzz

Yeah, that is using a binary hack, but previously it was using a patch on top of the source code, including a full re-compilation, which I think is similar to what you are asking for here? I guess you want the patch to be maintained inside this repo and not outside?

dergoegge commented at 4:44 pm on December 4, 2023: member

including a full re-compilation, which I think is similar to what you are asking for here? I guess you want the patch to be maintained inside this repo and not outside?

Yes but the patch I’m proposing is quite different, as it completely removes the map because I don’t think simply search and replacing std::getenv("FUZZ") let’s LTO prune the other harnesses’ code.

I’ll code up a PR for this so it’s more clear what I have in mind.

fuzz, brainstorm: Individual binaries per harness #28971