[PoC] Structure aware fuzzing with libprotobuf-mutator #26975

dergoegge commented at 4:57 PM on January 26, 2023: member

We currently use FuzzedDataProvider and a suite of Consume* functions for targets that require input formats other than a byte array. This approach is good for a lot of targets but has issues when it comes to more complex input formats.

The input corpora consist of custom input serialization formats, which means that the inputs have no meaning outside of the target itself. Seeding or sharing inputs is basically impossible when dealing with custom formats per target, however mutation based fuzzers are particularly effective when provided with an initial seed corpus (coverage guided fuzzers like libFuzzer are able to start from an empty corpus but that is less effective).
The fuzzer is not able to make useful mutations efficiently, because it only deals with raw bytes and is not aware of the input format. Fuzzers will still be able to create useful mutations, however only after many iterations.
Changing the target often leads to invalidation of the existing input corpus. For example, if the target is modified to interpret the input data in a more useful way, then the previous input corpus is invalidated, as the serialization format is modified.

libFuzzer provides an interface for dealing with structured input formats: LLVMFuzzerCustomMutator and LLVMFuzzerCustomCrossOver. Using this interface it is possible to curate input corpora with highly structured input formats (e.g. png files, json, encrypted, compressed, base64 encoded). This is described here in detail.

libprotobuf-mutator is a library for mutating protocol buffers, that also provides an interface around libFuzzer's custom mutator API. It allows us to specify input grammars using protobufs and exclusively provides useful mutations (i.e. mutations of the specified input format).

Using libprofobuf-mutator can address most of the issues of the ´FuzzedDataProvider` approach.

Input corpora exclusively consist of valid protobuf serializations. Meaning that seeding of corpora becomes quite easy, as all you need to do is provide your initial test cases in the protobuf format (i.e. have a script that produces useful initial test cases, similar to feature_taproot --dumptests except that it should spit out protobufs instead of json objects). Sharing inputs between targets becomes much easier (e.g. if two targets make use of transactions as inputs, then copying the transactions from one targets corpus to the other can easily be automated).
By default the protobufs are serialized into a human readable format, which makes debugging of crashes easier and also enables hand-rolling (initial) test cases.
IMO, writing protobuf definitions to define input grammars is very easy and maintainable. Looking at the protobuf definition gives an immediate overview of the input type a target takes (vs having to understand the combination FuzzedDataProvider and Consume* calls).
Modifying the target is possible without invalidating the existing inputs.
(We could likely get rid of quite a few of our Consume* functions meaning that there is less test only code to maintain.)

I have provided three examples in this PR that make use of libprotobuf-mutator.

Fuzzing mempool acceptance
Fuzzing the version handshake
Fuzzing validation (ProcessNewBlock, ProcessNewBlockHeaders, ProcessTransaction)

Building this PR

First clone and build libprotobuf-mutator, instruction can be found in their readme.

Then compile the protobuf definitions in this PR to c++:

cd src/test/fuzz/proto/
protoc *.proto --cpp_out .

Next configure and build the proto fuzzer binaries:

./configure --enable-fuzz --enable-proto-fuzz --with-sanitizers=fuzzer && make

If you did not install the libprotobuf-mutator libraries and headers onto your system, then you might have to set LDFLAGS and CPPFLAGS to point to your local LPM build.

If you manage to build and run the fuzzers, you can inspect the generated inputs with cat or any editor of your choosing.

Looking for conceptual review

DrahtBot commented at 4:57 PM on January 26, 2023: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.

Type	Reviewers
Concept ACK	chinggg

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

maflcko commented at 6:43 PM on January 26, 2023: member

Nice.

The fuzz framework was designed to be agnostic of the underlying fuzz engine. Does libprotobuf require libfuzzer?
How does this integrate into the fuzzing farms (Oss-Fuzz, and the ones based on fuzz/test_runner.py --generate)? we'd likely have to duplicate all fuzzing infra (qa-assets, coverage, CPU farms, ...)?
How does this deal with fuzzing input that are impossible to represent in protbuf? For example I could imagine a target that uses CallOneOf and conditionally deserializes inside one of its branches to not be representable, at least trivially?
How does this deal where the target code and the proto format are mismatching?
What you expect from devs: Should they write a proto fuzz test or a "normal" one, or both?
Do you have real world data on two identical targets, where the protbuf one finds a bug earlier?

dergoegge commented at 2:15 PM on January 27, 2023: member

Thanks for linking the issues, i have not seen those before.

The fuzz framework was designed to be agnostic of the underlying fuzz engine. Does libprotobuf require libfuzzer?

Not necessarily, I think LPM should be compatible with any fuzzing engine that has support for something similar to libFuzzer's custom mutator API (e.g. afl++: https://aflplus.plus/docs/custom_mutators/, they even mention LPM in these docs). Writing a wrapper to make that work should be easy enough. Out of the box however, there is only support for libFuzzer.

How does this integrate into the fuzzing farms (Oss-Fuzz, and the ones based on fuzz/test_runner.py --generate)?

I think technically, there should be no reason why this shouldn't just work(tm) on any of the fuzzing farms. As long as libFuzzer is used by the farm, it doesn't need to know about anything protobuf/LPM specific. There are multiple projects on oss-fuzz that make use of LPM (e.g. llvm, envoy). So integration should be quite straight forward, like our other fuzz targets.

we'd likely have to duplicate all fuzzing infra (qa-assets, coverage, CPU farms, ...)?

I don't think so, the corpora could still live in qa-assets and creating coverage reports should still be the same. If the proto fuzzers end up being in separate binaries then all we need to do is tell our fuzzing infra about those binaries (that's at least my understanding, you know more about our infra). I am certainly happy to help make this work with our current infra, if we decide to go this route.

How does this deal with fuzzing input that are impossible to represent in protbuf? For example I could imagine a target that uses CallOneOf and conditionally deserializes inside one of its branches to not be representable, at least trivially?

The proto fuzzers are intended to only receive well formed protobufs as inputs (afaict it ignores any inputs that don't match the input grammar) and the custom mutator makes sure of that. You wouldn't use CallOneOf the way you are describing (at least if I am understanding what you mean correctly). You can use FuzzedDataProvider with in a proto fuzz target by having raw bytes in the grammar definition (see the version handshake example, it still uses the fuzzed sock).

How does this deal where the target code and the proto format are mismatching?

I don't understand the question, could you give an example?

What you expect from devs: Should they write a proto fuzz test or a "normal" one, or both?

Don't have a solid framework on this yet but some thoughts.

It can definitely make sense to do both in some cases. A non-bitcoin example would be fuzzing a json parser, just taking the raw bytes from the fuzzer and interpreting that as a json string has value (i.e malformed garbage shouldn't cause any bugs), however that rarely produces valid (complex) json objects which is where LPM could come in (creating a grammar for json as a protobuf definition is quite easy see: https://source.chromium.org/chromium/chromium/src/+/main:testing/libfuzzer/proto/json.proto).
For simple targets where non-complex and few inputs are required, input splitting with FuzzedDataProvider is probably preferable, however if we have other targets that consume the same/similar inputs it could make sense to use LPM instead (to make sharing of inputs easier).
Any target that requires complex inputs (e.g. blocks, transactions, sequences of p2p messages, etc.) is likely to be better suited for LPM. That being said, I do think that targets like process_message(s) still make sense even if we have LPM targets for the same code, as they are really simple.
Any target where we might want to evolve the input grammar could be a good usecase for LPM, so that previous inputs aren't invalidated.

Do you have real world data on two identical targets, where the protbuf one finds a bug earlier?

Not yet, gonna try to code up an example.

maflcko commented at 2:36 PM on January 27, 2023: member

I don't understand the question, could you give an example?

Assuming the proto format is message FuzzOptions { optional int64 max_size_bytes = 1; }, but the code queries for optional max_bytes. I hope this will result in a compile failure and not a run-time std::nullopt?

maflcko commented at 2:41 PM on January 27, 2023: member

Also, I don't see how this will make it easier to share inputs without more glue code.

Assume one target is message _ {required bool _; required Tx _;} and another target doesn't have the bool, so it seems the two targets can't share the inputs, unless by accident or if some translator is written?

dergoegge commented at 4:45 PM on January 27, 2023: member

Assuming the proto format is message FuzzOptions { optional int64 max_size_bytes = 1; }, but the code queries for optional max_bytes. I hope this will result in a compile failure and not a run-time std::nullopt?

Yes this would cause a compile failure. The proto definitions are compiled to c++, so you can only actually use what's defined in your proto.

Also, I don't see how this will make it easier to share inputs without more glue code.

Assume one target is message _ {required bool _; required Tx _;} and another target doesn't have the bool, so it seems the two targets can't share the inputs, unless by accident or if some translator is written?

Yes for sharing we need some sort of glue code, but I think that could be a very generic script that takes two arbitrary proto definitions A and B and their corpora, and then copies values of compatible types from A's corpus into B's. LPM does something similar in its custom cross over function.

in src/test/fuzz/proto/atmp.proto:30 in b85bb8a30b outdated

  25 | +message TxOutput {
  26 | +  required int64 value = 1;
  27 | +  required Script script_pub_key = 2;
  28 | +}
  29 | +
  30 | +message Transaction {

maflcko commented at 4:49 PM on January 27, 2023:

nit: Is it possible to de-duplicate proto sections (for example common stuff like a transaction)?

dergoegge commented at 4:51 PM on January 27, 2023:

Yes, afaik you we could have a sort of library of proto definitions that can be reused.

dergoegge commented at 4:21 PM on February 2, 2023:

I have de-duplicated the common definitions in the latest push.

in src/Makefile.test.include:24 in b85bb8a30b outdated

  15 | @@ -13,6 +16,9 @@ endif
  16 |  TEST_SRCDIR = test
  17 |  TEST_BINARY=test/test_bitcoin$(EXEEXT)
  18 |  FUZZ_BINARY=test/fuzz/fuzz$(EXEEXT)
  19 | +PROTO_FUZZ_ATMP_BINARY=test/fuzz/proto/atmp$(EXEEXT)
  20 | +PROTO_FUZZ_VERSION_BINARY=test/fuzz/proto/version$(EXEEXT)
  21 | +PROTO_FUZZ_VALIDATION_BINARY=test/fuzz/proto/validation$(EXEEXT)

maflcko commented at 4:49 PM on January 27, 2023:

nit: Would be nice to only have one binary

maflcko approved

chinggg commented at 3:00 PM on February 1, 2023: contributor

Concept ACK

It's great to see a draft for structure-aware fuzzing. When I made fuzz targets for transaction handling, I heavily used APIs in FuzzDataProvider to create transactions with random inputs/outputs. I think it will be easier to create high-level objects with the help of libprotobuf-mutator.

I have a question about the generation of .proto definitions. It seems that these definitions should exactly represent corresponding C++ class definitions. I don't know if there are robust ways to convert C++ class definition to protobuf definitions. If the .proto files are manually written, developers should pay attention to ensure their correctness and remember to maintain them when updating C++ code. Nevertheless, I think the protocol of Bitcoin may not change frequently so proto definitions of fundamental concepts like transaction don't require change.

dergoegge commented at 12:10 PM on February 2, 2023: member

It seems that these definitions should exactly represent corresponding C++ class definitions. I don't know if there are robust ways to convert C++ class definition to protobuf definitions. If the .proto files are manually written, developers should pay attention to ensure their correctness and remember to maintain them when updating C++ code.

They do not need to represent the exact c++ classes, although that can make sense when you require the c++ types as input. For example, the Script message type in the proto definitions in this PR could be more expressive than a byte array. Even though the CScript c++ type is effectively just a byte array, it would be possible to express a grammar for Script (the language) (or just a subset thereof) as a proto.

Usually these targets will have one or more conversion functions from protobuf to c++ types. Should the c++ types change, then the compiler would likely fail on the conversion functions, reminding the devs to update/change the proto definitions.

[build] Introduce --enable-proto-fuzz, check for LPM libs 4d0ff07fc7

[proto fuzz] atmp example 5abb6c0d2d

[proto fuzz] p2p version handshake example 1e655f5222

[proto fuzz] validation example 90504f005d

dergoegge force-pushed on Feb 2, 2023

dergoegge commented at 4:21 PM on February 2, 2023: member

Added some rough build instructions to the PR description, LPM is no longer included as a subtree.

dergoegge closed this on May 30, 2023

maflcko commented at 4:32 PM on May 30, 2023: member

Did you find that this was useless in a benchmark, or did you just close for fun :smiling_face_with_tear: ?

dergoegge commented at 11:05 AM on July 17, 2023: member

Did you find that this was useless in a benchmark, or did you just close for fun?

Just wasn't my top priority but I plan on revisiting this. I've also been working on a library that does the same as libprotobuf-mutator but for Cap'n Proto, which might be more appropriate if we go the multi-process route (could also be used to fuzz the multi-process interfaces).

bitcoin locked this on Jul 16, 2024