fuzz: compact block harness #33300

pull Crypt-iQ wants to merge 7 commits into bitcoin:master from Crypt-iQ:cmpctblock-fuzz-0807-fs changing 11 files +492 −18
  1. Crypt-iQ commented at 10:33 pm on September 3, 2025: contributor

    Posting up to get feedback, there are some design flaws with the approach in this PR. Coverage is here (look in src/blockencodings.cpp, relevant compact block bits in src/net_processing.cpp).

    This harness can make (in)valid blocks, reconstruct blocks with in-mempool txns, mark peers as HB, and has high stability in AFL++ (~98-99%).

    The main downside is that there are filesystem operations. In the .init function initialize_cmpctblock, a chain of 200 blocks is created. Each fuzzing iteration then copies this statically-named, “cached” data directory to a temporary directory that gets deleted at the end of the iteration. If each fuzzing iteration instead mines its own chain, the execs/s slows down to a crawl (~0.5/s or less, which would also make CI runs really slow).

  2. DrahtBot added the label Tests on Sep 3, 2025
  3. DrahtBot commented at 10:33 pm on September 3, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/33300.

    Reviews

    See the guideline for information on the review process. A summary of reviews will appear here.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #33191 (net: Provide block templates to peers on request by ajtowns)
    • #29415 (Broadcast own transactions only via short-lived Tor or I2P connections by vasild)
    • #26812 (test: add end-to-end tests for CConnman and PeerManager by vasild)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  4. Crypt-iQ force-pushed on Sep 3, 2025
  5. Crypt-iQ force-pushed on Sep 3, 2025
  6. maflcko commented at 6:45 am on September 4, 2025: member

    There does not seem to be a way without signal handlers to delete the static, cached datadir when fuzzing is complete. Another issue is that multi-core fuzzing via AFL++ does not work because each worker will try to wipe the cached datadir (created via -testdatadir which wipes if the directory exists) which will cause other workers to crash if they are trying to copy it. I could not figure out a way for each worker to have their own cached datadir.

    Is my understanding correct that AFL will run init (create the static dir), then fork into several fuzz processes, which run for N iterations and then shut down and delete the static dir? (On the next re-fork, it will fail to find the static dir)?

    If yes, a solution could be to lazy-init the static dir after init, after the fork. Also, you may have to use the pid (or os-rand) to name the static folder to avoid collisions.

    As for deleting the dirs, maybe it is possible to spin up several testing setups at the same time:

    • One dummy testing setup to create a root dir, which is cleaned up
    • One testing setup living inside that root dir for the static folder (in a subfolder)
    • One testing setup living inside the root dir with a copy of the static folder

    (Not sure if this is possible, though)

  7. dergoegge commented at 8:59 am on September 4, 2025: member

    Is my understanding correct that AFL will run init (create the static dir), then fork into several fuzz processes, which run for N iterations and then shut down and delete the static dir? (On the next re-fork, it will fail to find the static dir)?

    We removed the __AFL_INIT call, so it will actually fork prior to init, so I think all that’s needed is to randomize the static dir name as you suggested as well.

  8. Crypt-iQ commented at 1:58 pm on September 4, 2025: contributor

    Is my understanding correct that AFL will run init (create the static dir), then fork into several fuzz processes, which run for N iterations and then shut down and delete the static dir? (On the next re-fork, it will fail to find the static dir)?

    My description was a bit vague. Since the harness uses -testdatadir in init to create the static datadir, it does not get deleted in the TestingSetup destructor (by design of -testdatadir). Any time init is called, it will wipe the static datadir if it already exists. This can happen after N iterations (100,000 currently) or with a newly spawned worker after forking since the fork point is prior to init like @dergoegge mentioned. If the datadir is deleted while a fuzz iteration tries to copy the non-existent directory, it will crash.

    a solution could be to lazy-init the static dir after init, after the fork

    I considered this, but I think this introduces some (slight) non-determinism as the directory may/may not exist?

    Also, you may have to use the pid (or os-rand) to name the static folder to avoid collisions.

    Yup, I agree. I think this randomization can only be done if the static datadir can be deleted at the end of fuzzing, otherwise there will be lots of these lying around?

    As for deleting the dirs, maybe it is possible to spin up several testing setups at the same time:

    Why is the dummy testing setup needed in this example? Could it instead just be two testing setups (one for the static datadir, one for the copied datadir)? I considered something similar to this as well where there is a static TestingSetup that gets destructed at the end of fuzzing (and wipes the static datadir) and one for each fuzz iteration (that gets destructed at the end of the iteration), but I was worried that the static TestingSetup would introduce more non-determinism? For example, what if this static TestingSetup has scheduling going on or is using the static datadir while we try to copy from it?

  9. Crypt-iQ force-pushed on Sep 5, 2025
  10. Crypt-iQ commented at 9:41 pm on September 5, 2025: contributor
    I was able to get AFL++ multi-core fuzzing to work by adding a static FuzzedDirectoryWrapper instance that deletes the passed path in its destructor as well as using a random element to each statically named datadir. Each instance is in the high 98-99% stability and does not crash after hitting 100,000 iterations (yay). I was not able to use multiple TestingSetups at the same time without causing several panics due to assertions about globals. I will fix the lint, tidy jobs and also see if the deterministic-fuzz-coverage issues still pop up.
  11. Crypt-iQ force-pushed on Sep 9, 2025
  12. Crypt-iQ commented at 4:39 pm on September 9, 2025: contributor

    There is non-determinism here because m_dirty_blockindex is std::set<CBlockIndex*> and sorts based on the pointer. This can be fixed by adding a function object that compares the block hashes. However, I did not know if this should be added just for fuzz code and I have not run benchmarks.

    After patching the above non-determinism locally, there is also non-determinism in the scheduler because this fuzz test uses the scheduler. I am guessing this is due to the RegisterValidationInterface call in the fuzz test notifying when transactions are being added or removed, and blocks being mined.

  13. Crypt-iQ marked this as ready for review on Sep 9, 2025
  14. in src/test/fuzz/cmpctblock.cpp:40 in 0de9143aa0 outdated
    35+using namespace util::hex_literals;
    36+
    37+namespace {
    38+
    39+//! Fee each created tx will pay.
    40+const CAmount amount_fee = 1000;
    


    dergoegge commented at 3:35 pm on September 11, 2025:

    nit

    0const CAmount AMOUNT_FEE{1000};
    

    Crypt-iQ commented at 3:38 pm on September 18, 2025:
    Thanks, done.
  15. in src/test/fuzz/cmpctblock.cpp:359 in 0de9143aa0 outdated
    354+                    return;
    355+                }
    356+
    357+                // Choose an existing block and send a HEADERS message for it.
    358+                size_t index = fuzzed_data_provider.ConsumeIntegralInRange<size_t>(0, num_blocks - 1);
    359+                CBlock block = *info[index].block;
    


    dergoegge commented at 3:37 pm on September 11, 2025:
    0                CBlock block = *info[index].block;
    1                block.vtx.clear();
    

    The headers message is a vector of blocks without transactions, so I think you should clear the copy of the block here? Although, technically this doesn’t matter because there’s only one header in the vector being sent, so the following txs are just ignored.


    Crypt-iQ commented at 3:38 pm on September 18, 2025:
    Ah nice catch, I completely missed this. Done.
  16. in src/test/fuzz/cmpctblock.cpp:320 in 0de9143aa0 outdated
    315+                    // Remove from shorttxids since we've prefilled. Subtract however many txs have been prefilled.
    316+                    cmpctBlock.EraseShortTxIDs(i - 1 - num_erased);
    317+                    ++num_erased;
    318+                }
    319+
    320+                CBlockHeaderAndShortTxIDs baseCmpctBlock = cmpctBlock;
    


    dergoegge commented at 3:44 pm on September 11, 2025:
    0                CBlockHeaderAndShortTxIDs baseCmpctBlock = cmpctBlock;
    

    nit: there are few instances of camel case usage here. Our convention is to use snake case (see dev notes).


    Crypt-iQ commented at 3:39 pm on September 18, 2025:
    Thanks, I believe I’ve fixed all of the camel case.
  17. dergoegge changes_requested
  18. dergoegge commented at 3:29 pm on September 12, 2025: member
    Did a first pass, overall approach looks good to me
  19. Crypt-iQ force-pushed on Sep 18, 2025
  20. Crypt-iQ force-pushed on Sep 18, 2025
  21. Crypt-iQ force-pushed on Sep 18, 2025
  22. Crypt-iQ force-pushed on Sep 18, 2025
  23. Crypt-iQ commented at 4:18 pm on September 18, 2025: contributor

    The latest push adds two commits:

    • a5619ea631bd8b93b4ef02a20abb8c1c0705d8e4 “test: add setup_validation_interface_no_scheduler to TestOpts”
      • Disables the scheduler completely if set. This is needed because this harness creates a TestingSetup inside FUZZ_TARGET and scheduling a promise can be non-deterministic as the scheduler’s serviceQueue may start with a non-empty taskQueue. This is not an issue in the other fuzz tests because their TestingSetup is created in .init and ResetCoverageCounters is called after.
    • e2f921458913bcbbe74115cdb2174b0ab31784f2 “node: sort m_dirty_blockindex by block hash”
      • I am ok with this commit being removed and want to know what others think. This sorts by block hash rather than memory address and does introduce a slow-down in production code for no production benefit (I think because memory addresses are ~generally going to be increasing while inserting into m_dirty_blockindex, whereas sorting by block hash won’t). I added it to show the change needed to make this harness fully non-deterministic. It is also possible to add an #ifdef so that it doesn’t negatively affect production code, but I also don’t want to set a precedent and litter the codebase with fuzz-specific macros. The sorting is needed as otherwise leveldb’s MemTable will be non-deterministic across runs. I tried some alternatives, but this is the best I could come up with that made the determinstic-fuzz-coverage script happy.
  24. net: move CMPCTBLOCK_VERSION to header 79ea9578d6
  25. fuzz: move FinalizeHeader from p2p_headers_presync.cpp to util.h
    This allows other fuzz tests to use the function to create valid block headers.
    254e13cd44
  26. fs: add std::filesystem::copy wrapper
    Disable implicit string conversion.
    eb92ad759d
  27. test: add -fuzzcopydatadir, modify -testdatadir semantics during fuzzing
    The -fuzzcopydatadir argument accepts a path to a cached data directory.
    It then copies everything in this data directory over to a new directory
    that will be deleted in ~BasicTestingSetup.
    
    The -testdatadir argument is modified to not add a random path element
    if EnableFuzzDeterminism() is true. This is necessary so the path of
    the data directory created via -testdatadir can be passed into
    -fuzzcopydatadir. Since the caller is BasicTestingSetup() is not aware of
    the random path element, finding it would be more complex than just
    not adding it.
    f95cfba890
  28. test: add setup_validation_interface_no_scheduler to TestOpts
    This option is mutually exclusive with setup_validation_interface
    and ensures that a scheduler is not created.
    797886aaa2
  29. fuzz: add harness for the compact blocks protocol
    This harness does one of the following operations in a loop:
    - send a cmpctblock message to the test node
    - send a blocktxn message to the test node
    - send a headers message to the test node
    - send a sendcmpct message to the test node
    - send a tx message to the test node
    - mine a block
    - set mock time
    
    The initialize function creates a TestingSetup and mines 200 blocks. Each fuzz
    iteration will then create a TestingSetup and copy the cached data directory via
    -fuzzcopydatadir.
    ba89cdd93e
  30. node: sort m_dirty_blockindex by block hash
    This avoids non-determinism since memory addresses change between
    runs for a single input.
    ed813c48f8
  31. in src/test/fuzz/cmpctblock.cpp:261 in a473ec6509 outdated
    256+            }
    257+        }
    258+
    259+        CBlockIndex* pindexPrev{WITH_LOCK(::cs_main, return setup->m_node.chainman->m_blockman.LookupBlockIndex(prev))};
    260+        setup->m_node.chainman->GenerateCoinbaseCommitment(*block, pindexPrev);
    261+        FinalizeHeader(header, *setup->m_node.chainman);
    


    marcofleon commented at 11:26 pm on October 1, 2025:
    Shouldn’t *block be passed in here instead of header? Right now, I think the block just ends up with the random nonce from when the header was created. Although this still works half the time with the simplifed PoW check. Also, calling GetHash() on the header includes the merkle root, so finalizing the header should be after we calculate the merkle root below.

    Crypt-iQ commented at 11:24 am on October 2, 2025:
    Ah yes, nice catch. I was wondering why the PoW check was still failing so many times in the coverage.

    Crypt-iQ commented at 4:41 pm on October 2, 2025:
    Fixed in ba89cdd93ee39e884c924ae12737dd06639cee8f, verified that the CheckBlockHeader call in validation.cpp no longer fails.
  32. Crypt-iQ force-pushed on Oct 2, 2025
  33. Crypt-iQ commented at 4:43 pm on October 2, 2025: contributor

    e2f921458913bcbbe74115cdb2174b0ab31784f2 -> ed813c48f826d083becf93c741b483774c850c86:

    • Fixes an issue pointed out by @marcofleon with the PoW checks failing
    • Adds more descriptive commit messages
  34. in src/test/util/setup_common.cpp:214 in ed813c48f8
    211+
    212+        m_path_root = rand_path();
    213+        TryCreateDirectories(m_path_root);
    214+
    215+        // Copy the cached directory into the newly created temporary directory.
    216+        fs::copy(cached_dir, m_path_root, fs::copy_options::recursive);
    


    stringintech commented at 2:29 pm on October 7, 2025:

    When fuzzing on macOS (Apple Silicon), I ran into lots of crashes when copying the cached dir into the destination:

    0libc++abi: terminating due to uncaught exception of type std::__1::__fs::filesystem::filesystem_error: filesystem error: in copy_file: File exists ["/var/folders/wc/3l6l_0b92zzg54v6rmvqqn300000gn/T/test_common bitcoin/cmpctblock/6c5e38c6dcfbc2607321/regtest/blocks/xor.dat"] ["/var/folders/wc/3l6l_0b92zzg54v6rmvqqn300000gn/T/cmpctblock_cachedca12974312651650d6b0/datadir/regtest/blocks/xor.dat"]
    

    The issue is that the destination dir may already exist before TryCreateDirectories(), and it happens when the AFL++ fork server is enabled (it does not happen when fuzzing with AFL_NO_FORKSRV=1). It appears that forked processes are running into collisions when generating the random path (rand_path()) for creating the destination dir, and when we switch to strong randomness (like how the static dir name is generated), the issue seems to be resolved:

     0diff --git a/src/test/util/setup_common.cpp b/src/test/util/setup_common.cpp
     1index e9768d4c5c..ac477caa62 100644
     2--- a/src/test/util/setup_common.cpp
     3+++ b/src/test/util/setup_common.cpp
     4@@ -164,8 +164,9 @@ BasicTestingSetup::BasicTestingSetup(const ChainType chainType, TestOpts opts)
     5         // tests, such as the fuzz tests to run in several processes at the
     6         // same time, add a random element to the path. Keep it small enough to
     7         // avoid a MAX_PATH violation on Windows.
     8-        const auto rand{HexStr(g_rng_temp_path.randbytes(10))};
     9-        return fs::temp_directory_path() / TEST_DIR_PATH_ELEMENT / test_name / rand;
    10+        std::vector<unsigned char> random_path_suffix(10);
    11+        GetStrongRandBytes(random_path_suffix);
    12+        return fs::temp_directory_path() / TEST_DIR_PATH_ELEMENT / test_name / HexStr(random_path_suffix);
    13     };
    14 
    15     if (m_node.args->IsArgSet("-testdatadir")) {
    

    Crypt-iQ commented at 5:29 pm on October 7, 2025:
    Are you fuzzing with persistent mode?

    stringintech commented at 6:43 pm on October 7, 2025:
    Yes. I am compiling using afl-clang-fast++ (it seems that other options are either unavailable or obsolete on macOS). However, I just added a #undef __AFL_LOOP at the top of fuzz.cpp to disable persistent mode (I made sure I no longer see the [+] Persistent mode binary detected log when running) and reverted the above change to rand_path(), but the same crash still occurs.

    Crypt-iQ commented at 11:27 am on October 8, 2025:
    Think this is because the child processes have the same g_rng_temp_path so they create the same directory. Will need to see if GetStrongRandBytes introduces non-determinism. I’m confused why fuzzing on debian didn’t bring this up.

    Crypt-iQ commented at 4:17 pm on October 8, 2025:
    Thanks for pointing this out, GetStrongRandBytes does not introduce non-determinism. I’ll change the next time I push up by making rand_path accept a bool strong so that the original behavior when -testdatadir is set is unchanged.

    stringintech commented at 3:54 pm on October 9, 2025:

    Since the crash I explained wasn’t observed by others, I doubted whether it was actually coming from multiple child processes running simultaneously or from missed data dir cleanup. So I added a bunch of logs (including pid and parent pid in each) to investigate this further and found this after reading the logs for multiple runs:

    I observed that AFL++ ran one worker process at a time, not multiple workers in parallel. Each worker created a static cached directory once, then ran multiple fuzz iterations sequentially; each iteration creating and destroying a temporary data directory. This worked fine until a worker died unexpectedly (not yet sure why), leaving behind both its last temp data directory and its static cached directory without cleanup.

    Each new worker had the same g_rng_temp_path as the previous worker, so it generated the same temp directory paths. When the next worker started, it faced the copy crash at exactly the same iteration where the previous worker died. For example, if process A died before completing iteration N, it left behind the temp data directory at that iteration. Process B then faced the fs::copy() crash at exactly iteration N, since the generated sequence of temp data directory paths was identical. This worker also died, leaving behind another orphaned static cached directory. This flow would repeat, eventually filling the disk with orphaned directories.


    stringintech commented at 4:06 pm on October 9, 2025:

    I’m confused why fuzzing on debian didn’t bring this up.

    Yeah, I am also confused by this. While my case wasn’t exactly a multi-process execution issue, it seems a true multi-process execution should observe the same crash because of the same g_rng_temp_path across processes.


    Crypt-iQ commented at 4:30 pm on October 9, 2025:
    Some signals are not specifically handled by the fuzzer, so if it creates a datadir (i.e. process_messages) and you Ctrl-C, it will leave it around instead of deleting it in ~BasicTestingSetup. In any case, not using g_rng_temp_path seems like the right choice.
  35. in src/test/fuzz/cmpctblock.cpp:147 in ba89cdd93e outdated
    142+
    143+    LOCK(NetEventsInterface::g_msgproc_mutex);
    144+
    145+    std::vector<CNode*> peers;
    146+    auto& connman = *static_cast<ConnmanTestMsg*>(setup->m_node.connman.get());
    147+    for (int i = 0; i < 3; ++i) {
    


    marcofleon commented at 4:40 pm on October 8, 2025:
    As we discussed a bit offline, increasing the number of peers here to greater than 3 should be enough to hit some of the missing “eviction” logic in MaybeSetPeerAsAnnouncingHeaderAndIDs. Maybe increasing to 8 peers would be fine? Although not sure if that would test anything more (somewhere else) than if it were 4.

    Crypt-iQ commented at 7:00 pm on October 8, 2025:
    Will compare the coverage from increasing to 4 vs increasing to 8.
  36. in src/node/blockstorage.cpp:495 in ed813c48f8
    491@@ -487,7 +492,7 @@ bool BlockManager::WriteBlockIndexDB()
    492     }
    493     std::vector<const CBlockIndex*> vBlocks;
    494     vBlocks.reserve(m_dirty_blockindex.size());
    495-    for (std::set<CBlockIndex*>::iterator it = m_dirty_blockindex.begin(); it != m_dirty_blockindex.end();) {
    496+    for (std::set<CBlockIndex*, CBlockIndexBlockHashComparator>::iterator it = m_dirty_blockindex.begin(); it != m_dirty_blockindex.end();) {
    


    marcofleon commented at 4:53 pm on October 8, 2025:

    I feel like I might be overlooking something here, but could we keep m_dirty_blockindex the same and then if EnableFuzzDeterminism() just sort vBlocks by hash before calling WriteBatchSync()?

    I guess I’m asking is there a reason why m_dirty_blockindex needs to be sorted by hash from the beginning vs right before the write.


    Crypt-iQ commented at 6:49 pm on October 8, 2025:

    Answered in the review club, also posting here:

    Without this commit, leveldb’s MemTable is non-deterministic since it depends on insert order. If we sort vBlocks right before calling WriteBatchSync, then the leveldb non-determinism is solved, but the number of times we call the comparison function varies since the initial ordering before sorting varies across runs of the same input. That means we need to sort when we insert into m_dirty_blockindex as the order of CBlockIndex* we insert here is deterministic (i.e. we’ll always insert A then B, even if it may sort as {A, B} or {B, A}). This is definitely confusing, I used pencil and paper to work this out.

    I did a simple benchmark and it showed ~O(n log n) slowdown. I was not sure how to bench this in a production scenario or what a realistic load here would look like. I guess it’s possible to bench two different IBDs, but there may be other non-IBD scenarios?

  37. marcofleon commented at 4:57 pm on October 8, 2025: contributor
    Some possible discussion about ed813c48f826d083becf93c741b483774c850c86 for review club.

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-10-10 12:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me