The current comparison tool tests network behavior that is too specific - in particular, it seems hard to adapt it to headers-first (which in some cases does not fetch blocks in the same order, e.g. because it already knows that it’s invalid through the header, even though at every point it should have the same best block) - correct me if I’m wrong, @TheBlueMatt.
One of the reasons is that the current comparison both incorporates the scenario of comparisons to run, and the logic to compare it to (BitcoinJ full node code). This makes it hard to adapt.
I propose writing a tool that does nothing but implement a testing scenario, and run it against 2 bitcoind’s in parallel: an old trusted version, and the new version to be tested, and compares whether at every point in time (after synchronizing through a ping/ping) they agree on the best chain.
This could be as simple as a python node with an associated block database and block header tree, that can answer getdata/getheaders/getblocks, and runs through a scripted scenario, but doesn’t do any validation itself.