Goal
This PR refactors the benchmarking functions as per #1701, in order to make benchmarks more deterministic and less influenced by the environvment.
This is achieved by replacing Wall-Clock Timer with Per-Process CPU Timer when possible.
Just some quick comments:
[x] remove the number of runs (
count
) in favor of simpler cleaner approach with just number of iterations (iter
).
I think there’s a reason to have this. Some benchmarks take much longer than others, so it probably makes sense to run fewer iters for these.
[x] remove
min
andmax
statistics in favor of simpler approach with justavg
.
I think min
and max
are useful. For constant-time code, you can also compare min
. And max
gives you an idea if there were fluctuations or not.
[x] remove needless fixed point conversion in favor of simpler floating point divisions.
Well, okay, that has a history; see #689. It’s debatable if it makes sense to avoid floating point math, but as long as it doesn’t get in your way here, it’s a cool thing to keep it. :D
min
and max
just complicate things. let me explain:
first of all, as it is right now, they don’t even measure the min and max, they just measure the min/max of the averages of all runs. aka not the absolute. Furthermore, in order to have them, one would need to run all the iterations 10 times more. benchmarks are already slow, adding this min/max slows them by 10 fold. imho it’s completely unnecessary. @real-or-random
If we’re going to rework this, I’d suggest using the stabilized quartiles approach from https://cr.yp.to/papers/rsrst-20250727.pdf:
I think there’s a reason to have this. Some benchmarks take much longer than others, so it probably makes sense to run fewer iters for these.
right now all benchmarks are run with count=10 and fixed iters (apart from ecmult_multi which adjusts the number of iters, not count).
therefore count
is only useful to extrapolate min
and max
Well, okay, that has a history; see #689. It’s debatable if it makes sense to avoid floating point math, but as long as it doesn’t get in your way here, it’s a cool thing to keep it. :D
I disagree with #689. It overcomplicate things for the sake of not having floating point math. those divisions aren’t even in the hot path, they’re outside the benchmarks.
Concept NACK on removing any ability to observe variance in timing. The current min/avg/max are far from perfect, but they work fairly well in practice. Improving is welcome, but removing them is a step backwards.
what is the usefulness of measuring min/max when we are removing OS interference & thermal throttling out of the equation? min/max will be extremely close to the avg no matter how bad the benchmarked function is.
gettimeofday()
is officially discouraged since 2008 in favor of clock_gettime()
. The POSIX standard marks it as obsolescent but still provides the API for backward compatibility.
even though the manual says that CLOCK_PROCESS_CPUTIME_ID
is only useful if the process is locked to a core, modern CPUs have largely addressed this issue. So I think it is fair to compile CLOCK_PROCESS_CPUTIME_ID
even though we don’t have the guarantee that the user has pinned the benchmarking process to a core. The worst case scenario is a unreliable benchmark, which the current repo has anyways.
I added a line in the README.md for best practices to run the benchmarks.
I also tried adding a function to pin the process to a core directly in C, but there’s no standard POSIX compliant way to do so. There is pthread_setaffinity_np()
on linux, where ’np’ stands for ’not portable'
Concept NACK on removing any ability to observe variance in timing. The current min/avg/max are far from perfect, but they work fairly well in practice. Improving is welcome, but removing them is a step backwards.
what is the usefulness of measuring min/max when we are removing OS interference & thermal throttling out of the equation? min/max will be extremely close to the avg no matter how bad the benchmarked function is.
The point is exactly having a simple way of verifying that there’s indeed no interference. Getting rid of sources of variance is hard to get right, and it’s impossible to get a perfect solution. (This discussion shows this!) So we better have a way of spotting if something is off.
I like the stabilized quartiles idea.
I like the stabilized quartiles idea.
tbh it scares me a bit, will see what I can do. Maybe in a future PR.
29@@ -30,6 +30,8 @@ set(${PROJECT_NAME}_LIB_VERSION_AGE 0)
30 #=============================
31 set(CMAKE_C_STANDARD 90)
32 set(CMAKE_C_EXTENSIONS OFF)
33+# Enable POSIX features while maintaining ISO C compliance
34+add_compile_definitions(_POSIX_C_SOURCE=200112L) #needed for `clock_gettime()` in bench.c
bench.c
?
src/CMakeLists.txt
targeting only the benchmarks.
158@@ -159,6 +159,8 @@ Benchmark
159 ------------
160 If configured with `--enable-benchmark` (which is the default), binaries for benchmarking the libsecp256k1 functions will be present in the root directory after the build.
161
162+For an accurate benchmark, it is strongly recommended to pin the process to a dedicated CPU core and to disable CPU frequency scaling.
6@@ -7,30 +7,63 @@
7 #ifndef SECP256K1_BENCH_H
8 #define SECP256K1_BENCH_H
9
10+#if defined(_WIN32)
11+# include <windows.h>
12+#else
13+# include <time.h>
14+# include <sys/time.h>
gettimeofday()
and one for clock_gettime()
. they don’t conflict.
just tried. unfortunately the macros to detect wether gettimeofday()
or clock_gettime()
is needed are defined in the time.h
header. so the only viable solution is:
0#include <time.h>
1#if !defined(CLOCK_PROCESS_CPUTIME_ID) && !defined(CLOCK_MONOTONIC) && !defined(CLOCK_REALTIME)
2# include <sys/time.h>
3#endif
which every LLM tells me is ‘inherently unreliable’ and can lead to portability issues.
‘Including sys/time.h on modern POSIX systems that already have time.h and clock_gettime doesn’t hurt, and ensures gettimeofday is available if needed.’
‘Trying to skip it adds complexity with almost no benefit.’
clock_gettime
on POSIX? https://pubs.opengroup.org/onlinepubs/9799919799/functions/clock_getres.html says that the proper macro to check is _POSIX_CPUTIME
and that CLOCK_MONOTONIC
must be supported, so there’s no need to check for it.
CLOCK_MONOTONIC
support mandatory, this makes it way simpler. no need to include <sys/time.h>
anymore.
CLOCK_REALTIME
as the last fallback.
_POSIX_CPUTIME
and _POSIX_MONOTONIC
flags are to be used when you want to ensure that the library has mandatory support for those features. In my case for example, gcc doesnt expose _POSIX_MONOTONIC
but still supports it. Therefore it’s better to check CLOCK_PROCESS_CPUTIME_ID
and CLOCK_MONOTONIC
directly.
49+
50+ return (int64_t)((k.QuadPart + u.QuadPart) / 10);
51+#else /* POSIX */
52+
53+# if defined(CLOCK_PROCESS_CPUTIME_ID)
54+ /* in theory, CLOCK_PROCESS_CPUTIME_ID is only useful if the process is locked to a core. see https://linux.die.net/man/3/clock_gettime */
75+ /* WARN: timer is influenced by environvment (OS, scheduling, interrupts...) */
76 struct timeval tv;
77- gettimeofday(&tv, NULL);
78- return (int64_t)tv.tv_usec + (int64_t)tv.tv_sec * 1000000LL;
79+ gettimeofday((struct timeval*)&tv, NULL);
80+ return (int64_t)tv.tv_sec * 1000000 + tv.tv_usec;
How is this diff an improvement?
LL
is not needed.69+ struct timespec ts;
70+ if (clock_gettime(CLOCK_REALTIME, &ts) != 0) {
71+ return 0;
72+ }
73+ return (int64_t)ts.tv_sec * 1000000 + ts.tv_nsec / 1000;
74+# else
yes, I tried everything you mentioned.
removing redundancy here can be done by just using the ifdefs for the constant, but you’d need a different path for the last fallback. the last case uses tv, not ts.
I tried issuing warnings with #warning
to tell the user about the precision of the clock but it’s not compatible with c90.
Here’s the DRY’d version:
0# if defined(CLOCK_PROCESS_CPUTIME_ID)
1 /* in theory, CLOCK_PROCESS_CPUTIME_ID is only useful if the process is locked to a core. see https://linux.die.net/man/3/clock_gettime */
2const clockid_t clk_id = CLOCK_PROCESS_CPUTIME_ID;
3# elif defined(CLOCK_MONOTONIC)
4 /* WARN: timer is influenced by environvment (OS, scheduling, interrupts...) */
5const clockid_t clk_id = CLOCK_MONOTONIC;
6# elif defined(CLOCK_REALTIME)
7 /* WARN: timer is influenced by environvment (OS, scheduling, interrupts...) */
8const clockid_t clk_id = CLOCK_REALTIME;
9# else
10# define USE_GETTIMEOFDAY
11# endif
12
13# ifndef USE_GETTIMEOFDAY
14struct timespec ts;
15if (clock_gettime(clk_id, &ts) != 0) {
16 return 0;
17}
18return (int64_t)ts.tv_sec * 1000000 + ts.tv_nsec / 1000;
19# else
20 /* WARN: timer is influenced by environvment (OS, scheduling, interrupts...) */
21struct timeval tv;
22gettimeofday(&tv, NULL);
23return tv.tv_usec + (int64_t)tv.tv_sec * 1000000;
24# endif
less readable imo
I tried issuing warnings with
#warning
to tell the user about the precision of the clock but it’s not compatible with c90.
What I had in mind is printing the warning out at runtime.
35- struct timespec tv;
36- if (!timespec_get(&tv, TIME_UTC)) {
37- fputs("timespec_get failed!", stderr);
38- exit(EXIT_FAILURE);
39+ if (GetProcessTimes(GetCurrentProcess(), &creation, &exit, &kernel, &user) == 0) {
40+ return 0;
The point is exactly having a simple way of verifying that there’s indeed no interference. Getting rid of sources of variance is hard to get right, and it’s impossible to get a perfect solution. (This discussion shows this!) So we better have a way of spotting if something is off.
fine, but at least let’s make it optional. I don’t like my benchmarks being 10 times slower just because min/max need to be computed. if I say 20'000 iterations I want it to be 20`000 iterations.
The point is exactly having a simple way of verifying that there’s indeed no interference. Getting rid of sources of variance is hard to get right, and it’s impossible to get a perfect solution. (This discussion shows this!) So we better have a way of spotting if something is off.
fine, but at least let’s make it optional. I don’t like my benchmarks being 10 times slower just because min/max need to be computed. if I say 20'000 iterations I want it to be 20`000 iterations.
Not sure. I suggest making this PR focused on a single (uncontroversial) change, which is switching to per process clocks.
56+
57+# if defined(CLOCK_PROCESS_CPUTIME_ID)
58+ /* In theory, CLOCK_PROCESS_CPUTIME_ID is only useful if the process is locked to a core.
59+ * See https://linux.die.net/man/3/clock_gettime.
60+ * In practice, modern CPUs have largely addressed this issue. The worst case scenario is a unreliable benchmark.
61+ * See https://chatgpt.com/s/dr_68b7486a05a481919d9f121d182cc0cf
203@@ -176,6 +204,14 @@ static int get_iters(int default_iters) {
204 }
205 }
206
207+static void print_clock_info(void) {
208+#if defined(_WIN32) || defined(CLOCK_PROCESS_CPUTIME_ID)
209+ printf("INFO: using CPU timer, results are not influenced by other running processes\n\n");
CLOCK_PROCESS_CPUTIME_ID
. I changed the message to notify the user that he should probably pin the process.
I changed the message to notify the user that he should probably pin the process.
If that’s not required on modern CPUs, why advise to do this?
We could just say “INFO: Using per-process CPU timer”.
54+
55+#else /* POSIX */
56+
57+# if defined(CLOCK_PROCESS_CPUTIME_ID)
58+ /* In theory, CLOCK_PROCESS_CPUTIME_ID is only useful if the process is locked to a core. see https://man7.org/linux/man-pages/man2/clock_gettime.2.html *
59+ * in practice, modern CPUs have synchronized TSCs which addresses this issue. see https://docs.amd.com/r/en-US/ug1586-onload-user/Timer-TSC-Stability */
0 /* In theory, CLOCK_PROCESS_CPUTIME_ID is only useful if the process is locked to a core,
1 * see `man clock_gettime` on Linux. In practice, modern CPUs have synchronized TSCs which
2 * address this issue, see https://docs.amd.com/r/en-US/ug1586-onload-user/Timer-TSC-Stability . */
not familiar with nano bench, but I can see it uses std::chrono
. which uses CLOCK_MONOTONIC
(our fallback) under the hood, not TSC.
also the user would need to have a c++ compiler, not a big real but I think this repo needs to strictly adhere to c89.
This commit improves the reliability of benchmarks by removing some of the influence of other background running processes. This is achieved by using CPU bound clocks that aren't influenced by interrupts, sleeps, blocked I/O, etc.
QueryPerformanceCounter()
which has microsecond precision.
I also removed if checks from clock_gettime()
as it can only fail for programming errors.
I haven’t investigated this in detail but couldn’t find any related issue: were there any efforts to use nanobench.ankerl.com like we do in Core instead?
nanobench is a framework for benchmarking C++ but this is C. Of course, we could still call our C from C++, but adding C++ to the code base seems a bit overkill to me if if’s “just” for the purpose of benchmarks. I agree that using an existing framework may be a good idea, but I’m not aware of any in pure C.
but adding C++ to the code base seems a bit overkill to me if if’s “just” for the purpose of benchmarks
The purpose would be to standardize the benchmarking to avoid reinventing solutions to problems that aren’t strictly related to the purpose of this library. It would also help with trusting the result more, since we already know how that benchmarking library behaves. The main client of this library is C++ code, it doesn’t seem far-fetched to me to test that.
but adding C++ to the code base seems a bit overkill to me if if’s “just” for the purpose of benchmarks
The purpose would be to standardize the benchmarking to avoid reinventing solutions to problems that aren’t strictly related to the purpose of this library. It would also help with trusting the result more, since we already know how that benchmarking library behaves. The main client of this library is C++ code, it doesn’t seem far-fetched to me to test that.
Sorry, I didn’t want to be discouraging. I don’t think at all that nanobench is a far-fetched idea. And I truly agree that reinventing the wheel is not great.
In the end, what counts is the maintenance burden. My main concern with C++ is that I’m not sure how fluent our regular contributors are in C++. (I’m not, but luckily I’m not the only one here.) If the bit of C++ is harder to touch (for us) than this code, then nanobench is not a good idea. If, on the other hand, we have people who can maintain C++, or if we get support from the Core contributors who are familiar with nanobench, then this can be a win.
I’d be happy to see a demo/WIP PR but if you feel that this is a lot of work, then it might be a good idea to see what other contributors and maintainers think.