Builds on top of #34023, part of #30289.
This introduces a more accurate cost model for SFL, to control how much CPU time is spent inside the algorithm for clusters that cannot be linearized perfectly within a reasonable amount of time.
The goal is having a metric for the amount of work performed, so that txmempool can impose limits on that work: a lower bound that is always performed (unless optimality is reached before that point, of course), and an upper bound to limit the latency and total CPU time spent on this. There are conflicting design goals here:
- On the one hand, it seems ideal if this metric is closely correlated to actual CPU time, because otherwise the limits become inaccurate.
- On the other hand, it seems a nightmare to have the metric be platform/system dependent, as it makes network-wide reasoning nearly impossible. It’s expected that slower systems take longer to do the same thing; this holds for everything, and we don’t need to compensate for this.
There are multiple solutions to this:
- One extreme is just measuring the time. This is very accurate, but extremely platform dependent, and also non-deterministic due to random scheduling/cache effects.
- The other extreme is using a very abstract metric like counting how many times certain loops/function inside the algorithm run. That is what is implemented as of #32545 and #34023, just counting the sum of the numbers of transactions updated across all
UpdateChunks()calls. It however necessarily fails to account for significant portions of runtime spent elsewhere, resulting in a rather wide range of “ns per cost” values. - This PR takes a middle ground, counting many function calls / branches / loops, with weights that were determined through benchmarking on a reference system (Zen 4 based).
This is draft because:
- The weights used are outdated, as several non-trivial improvements were made to #32545 and #34023 since.
- I don’t know if it is the right middle ground, and it seems good to have some conceptual discussion first:
- Using benchmarks on a reference system inherently captures some platform specifics, making it possible that the metric is not very accurate on other systems.
- A benchmark-based metric is nontrivial to maintain.