Part of #30289, replaces earlier #34138.
This introduces a more accurate cost model for SFL, to control how much CPU time is spent inside the algorithm for clusters that cannot be linearized perfectly within a reasonable amount of time.
The goal is having a metric for the amount of work performed, so that txmempool can impose limits on that work: a lower bound that is always performed (unless optimality is reached before that point, of course), and an upper bound to limit the latency and total CPU time spent on this. There are conflicting design goals here:
- On the one hand, it seems ideal if this metric is closely correlated to actual CPU time, because otherwise the limits become inaccurate.
- On the other hand, it seems a nightmare to have the metric be platform/system dependent, as it makes network-wide reasoning nearly impossible. It’s expected that slower systems take longer to do the same thing; this holds for everything, and we don’t need to compensate for this.
There are multiple solutions to this:
- One extreme is just measuring the time. This is very accurate, but extremely platform dependent, and also non-deterministic due to random scheduling/cache effects.
- The other extreme is using a very abstract metric like counting how many times certain loops/function inside the algorithm run. That is what is implemented in master right now, just counting the sum of the numbers of transactions updated across all
UpdateChunks()calls. It however necessarily fails to account for significant portions of runtime spent elsewhere, resulting in a rather wide range of “ns per cost” values. - This PR takes a middle ground, counting many function calls / branches / loops, with weights that were determined through benchmarking on an average on a number of systems.
Specifically, the cost model was obtained by:
- For a variety of machines:
- Running a fixed collection of ~385000 clusters found through random generation and fuzzing, optimizing for difficulty of linearization.
- Linearize each ~3000 times, with different random seeds. Sometimes without input linearization, sometimes with a bad one.
- Gather cycle counts for each of the operations included in this cost model, broken down by their parameters.
- Linearize each ~3000 times, with different random seeds. Sometimes without input linearization, sometimes with a bad one.
- Correct the data by subtracting the runtime of obtaining the cycle count.
- Drop the 5% top and bottom samples from each cycle count dataset, and compute the average of the remaining samples.
- For each operation, fit a least-squares linear function approximation through the samples.
- Running a fixed collection of ~385000 clusters found through random generation and fuzzing, optimizing for difficulty of linearization.
- Rescale all machine expressions to make their total time match, as we only care about relative cost of each operation.
- Take the per-operation average of operation expressions across all machines, to construct expressions for an average machine.
- Approximate the result with integer coefficients.
The benchmarks were performed by l0rinc <pap.lorinc@gmail.com> and myself, on AMD Ryzen 5950X, AMD Ryzen 7995WX, AMD Ryzen 9980X, Apple M4 Max, Intel Core i5-12500H, Intel Core Ultra 7 155H, Intel N150 (Umbrel), Intel Core i7-7700, Intel Core i9-9900K, Intel Haswell (VPS, virtualized), ARM Cortex-A76 (Raspberry Pi 5).
Based on final benchmarking, the “acceptable” iteration count (which is the minimum spent on every cluster) is to 64000 units, which corresponds to roughly 50 μs on Ryzen 5950X and similar modern desktop hardware.