Userland relative benchmarking on POSIX

The last log The end of benchmarking paints a bleak picture of code development and the impossibility to optimise on modern CPUs. But I still need an objective and reliable way to tell if a given piece of code is better than the other. Because that's all we can do these days, there is no absolute performance metric, it's impossible now.

So a benchmark must now be relative.

But given the previous results from a purely CPU-bound loop, with 16% variance, what accuracy can we expect anyway ? After all Unix/POSIX is not designed for high precision, since real-time features are add-ons, not defining properties. But we need all we can get so kernel-level scheduling and RDTSC are to be used.

There is also the problem that the test affects the state of the computer, which would react adversely. So the testing process should have some "fallow" times to let the rest of the system rest a bit and manage its backlog (like printing stuff to the screen or flushing its file buffers).

So the approach is not "speed run a piece of code and clock it", but

select two or more pieces of code to test, an infinite loop that can be exited by reading a volatile, to have a baseline and compare alternativeS.
run them each in turn for 1 second, separated by one second of sleep, to interleave the runs, prevent the CPU from heating, get relative performance figures while also reducing the effects of spurious interferences
Compare each "slice" individually and then aggregate the results later.

The program should run on only one CPU-core-thread to prevent interference when switching from one core to another, or inter-thread communication or migration...

The switch from one test code to another is best controlled by the POSIX signals, but their jitter is not helpful (let's say 1%) so each second-long run needs RDTSC. The periodic signal provides a coarse granularity and the RDTSC provides the fine reading, compared to the exact work performed (another counter). So the algorithm would be something like:

setup 1Hz timer signal

loop 100 times :
  Test 1 (1second)
  process partial result
  sleep (1second)
  Test 2 (1second)
  process partial result
  sleep (1second)
  Test 3 (1second)
  process partial result
  sleep (1second)

The funny thing is that the periodic signal in POSIX has two effects: not only does it trigger custom code but it also stops ongoing sleep. So there is only one periodic signal to setup. Now, the problem with UNIX is that it creates more frequent interrupts so we don't know if/how often the test procedure has heen context-switched. At least the sleep yields and relieves the pressure for a little while.

I put together this early version to check my assertions:

/* bench.c
created dim. avril 30 01:46:25 CEST 2023
 gcc -Wall -Os -o bench bench.c  && ./bench 
*/

#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>
#include <signal.h>
#include <unistd.h>
#include <sys/time.h>

/////////////////////////////////////////////////////////////////

inline uint64_t read_rdtsc() {
  uint64_t eax, edx;
  asm volatile("rdtsc" : "=a" (eax), "=d" (edx));
  return eax | (edx << 32);
}

/////////////////////////////////////////////////////////////////

volatile unsigned int timer_counter=0;
volatile uint64_t last_timestamp=0;

// callback 
void timer_handler(int i) {
  timer_counter++;
  last_timestamp=read_rdtsc();
}

/////////////////////////////////////////////////////////////////

struct itimerval itv_poll = { {1,0}, {1,0} };

int main() {
  uint64_t prev_timestamp=0;
  __sighandler_t e=signal(SIGALRM, timer_handler);
  if ( (e==SIG_ERR)
   || ((e!=SIG_IGN) && (e!=SIG_DFL))) {
    printf("\nsignal(): realtime signal is already busy ??");
    exit (EXIT_FAILURE);
  }

  if (setitimer(ITIMER_REAL, &itv_poll, NULL) == -1) {
    printf("\n can't start timer");
    exit (EXIT_FAILURE);
  }

  prev_timestamp=read_rdtsc();
  for (int i=0; i<20; i++) {
    sleep(2); // gets interrupted by SIGALRM
    printf("%"PRIu64"\n",
      last_timestamp-prev_timestamp);
    prev_timestamp = last_timestamp;
  }

  exit (EXIT_SUCCESS);
}

(uploaded to bench.basic.c)

Results are OK, with a pretty low variance.

The max jitter is +/ 500.000 for a runtime of 2G694, that's 0.2% due to the POSIX jitter, which is also then compensated by the precise TSC.

The end of benchmarking.

Differential benchmarking

Discussions

Become a Hackaday.io Member