top | item 42646447

(no title)

8dcc | 1 year ago

Yes, you are right. Do you have any suggestions on how I should do the benchmarking? I am not sure how this is usually done, but if the output is some kind of graph, I would also like to know how these graphs are usually produced.

discuss

kragen|1 year ago

I'm no expert on this, so probably someone else can give you better advice, but I'm happy to share what I do know. Here are the most important points.

Benchmarking is a full-fledged scientific experiment. You have a hypothesis that a particular change that you can make, such as using a different allocator, will have a particular effect, such as making your program run in less time. So you make just that one single change and measure the results, while carefully controlling all other conditions that could also affect your program's run time. You have to do the test more than once in order to find out whether you've succeeded in controlling them. If you do a good enough job controlling the conditions of the experiment, it becomes reproducible and therefore falsifiable. Programming usually isn't science, but benchmarking is. A great thing about it is that each run of the experiment can take a fraction of a second, so you can do the whole scientific method from beginning to end in a few minutes and get a reproducible, falsifiable result.

Benchmarking can get arbitrarily sophisticated, but the most important thing is to dodge the bullets that can turn benchmarks into nonsense, rather than produce very sophisticated graphs or get very high precision.

The simplest thing on Linux or macOS is to compile a command-line executable you can run in a terminal and run

    time ./testprogram

at least three times to get an idea of how long the wallclock time is and how variable it is. (The user and system times shown are not very reliable anymore.) Presumably there is also some way to do the same thing on Microsoft Windows. If you're on a laptop, check to see if it makes a difference if you're plugged in. Also, check whether the result from

    time ./testprogram; time ./testprogram

is consistent; sometimes lags in CPU speed scaling can produce unpredictable results.

Linux is not very consistent, so you should expect variations in the range of 1–10% from run to run when your program takes on the order of a second to run.

If you get consistently shorter wallclock times after recompiling the program with a different allocator and the same compiler options (probably make sure optimization isn't completely turned off), you can probably attribute that difference to the different allocator. It's a good idea to switch back to the original allocator and verify that you can reproduce your original results to make sure you didn't accidentally change something else at the same time (maybe your laptop went into low-power mode because the battery is getting low, or maybe you changed the number of iterations in a loop and forgot you changed it, or maybe Firefox loaded a YouTube video in the background and is making the machine swap, that kind of thing).

It's useful to ensure that you aren't swapping and that your CPU load other than the benchmark is minimal. `top` (or, better, `htop`) and `vmstat 5` (or, better, `dstat 5`) can be helpful here. Unless your program is heavily multithreaded, the CPU load is not as crucial as it used to be now that we all have multicore processors.

It's helpful to check the particular version of the program you're benchmarking into Git, including the Makefile or whatever specifies the compiler options. That way when you write down your results you can refer to the specific Git commit hash. Also, record your compiler version, your operating system version, and your CPU. Ideally someone who isn't you should be able to read your notes on the benchmark, run it again, and get the same results, if they have access to the same hardware.

The actual benchmark program itself can be as simple as allocating and deallocating in a loop, ideally inside another loop to run the whole benchmark multiple times; but on processors with cache (which includes every PC since the early 90s) the working set size can be very important, and you may get significantly different results for allocating 20 kilobytes 64 bytes at a time and 20 gigabytes 64 bytes at a time. If you pass in the number of iterations for one of the loops on the command line, instead of hardcoding it in the program code, it's easy to start with a small iteration count and work your way up a factor of 10 at a time until the program takes a few seconds to run. In this case, also be sure to record the specific command line you were measuring the performance of in your notes; knowing that a program took 2.205–2.216 seconds to run is of no value if you don't know if it was running a thousand iterations or a million.

It's a good idea to actually compute something that you output using the memory you're allocating and deallocating, just in case GCC's optimizer decides that your benchmark loop has no effect and therefore can be safely removed from your program. (In http://canonical.org/~kragen/sw/dev3/kmregion_example.c I computed the sum of the data value stored in each node of the 24th of all the 5000-node linked lists I allocated and deallocated. That's because, when I read the disassembly of my test code with `objdump -d kmregion_example`, I found out that GCC had removed most of the code inside the test loop because it could prove that nobody ever read the struct fields I was initializing. Reading disassembly is neither necessary nor sufficient for dodging all bullets that invalidate benchmarks.)

Varying the number of iterations in both the innermost and outermost loop should produce roughly proportional increases and decreases in wallclock time. (A helpful technique is to subtract the time for 1000 iterations from the time for 1001000 iterations to get the time for the last million iterations.) If it doesn't, you're not measuring what you think you're measuring. For example, especially with JIT compilers, you may be measuring startup overhead. That bit me this week with http://canonical.org/~kragen/sw/dev3/hadamardsin.js, which I erroneously reported was slower to run than the LuaJIT version. It turns out that it's slightly faster to run, just much slower to compile.

Running the benchmark on more than one computer is helpful because sometimes changes that speed things up on one system slow them down on another. I remember a recursive backtracking search program for logic circuit optimization I wrote in C long ago which used longjmp() to terminate the search when it found a solution; it ran reasonably fast on Linux but very slowly on my friend's Macintosh because MacOS implemented setjmp() as sigsetjmp(), transforming it from a very cheap operation into a very expensive one.

That's all you need to know to reliably do optimizations. In fact, that's probably more than you need to know. Everything below this point is extra.

— ⁂ —

Allocators are especially tricky to benchmark because often, by changing where data is stored in memory, they profoundly affect the speed of the rest of your program, in ways that vary from program to program. More memory fragmentation, for example, can increase how many TLB misses your program runs into. I don't know what to do about that except try to benchmark some real programs as well as simple nested loops.

In cases where you're trying to measure an effect that's roughly the size of your measurement precision, statistical tests can be useful. https://github.com/c-blake/bu/blob/main/doc/edplot.md linked by cb321 above discusses using the 2-sample Kolmogorov-Smirnov test, which is not nearly as terrifying as it sounds. But most effects will be either much larger than your measurement error, in which case the result is obvious enough that you don't need the statistics, or much smaller, in which case all they'll tell you is that you can't be sure about the direction of the effect, much less its magnitude. In medicine, where your experiments take years and people die in them, using sophisticated statistics to make the most of your precious data is very important. In performance benchmarking, where your experiments take milliseconds, it's usually better to improve your measurement precision instead. Sometimes just running the same benchmark more times is enough, but there are more powerful techniques available.

A way to get higher precision than `time` can give you is to run the program under `valgrind` (on Linux). Even raw valgrind will tell you exactly how many instructions it ran, and in most cases that number is the same from run to run, so instead of a measurement with ±3% precision like `time` usually gives you when you've controlled everything properly, you get a measurement typically with ±0.00000001% precision. This is fantastic, far beyond most things you can do in a physics, electronics, or chemistry lab. It means that even with just two runs you can tell if your optimization affected that number by even 0.0000001%, or, more interestingly, 0.1%. And it isn't affected by things like which CPU you run it on or whether your laptop is unplugged, and it only slows the program down about 10×.

Unfortunately, it measures the wrong thing, because a program that runs more instructions can still be faster. `valgrind --tool=cachegrind` tries to simulate your CPU's cache hierarchy to give you a better idea of how long the program will actually take; the `kcachegrind` GUI can give you a "cycle estimation" number based on the numbers it spits out. But it can still be unrelated to how fast the program actually runs for a variety of reasons; system calls are the biggest one. `kcachegrind` is also a "profiler": it shows you how much time (simulated instructions, simulated cache misses, cycle estimation, whatever) is used by each subroutine in your program.

You might wonder why an optimization that speeds your program up by 0.1% matters. The answer is that if you do 106 such optimizations you have sped your program up by 10%. SQLite has improved its performance by more than a factor of 3 over a 10-year period by using this approach: https://www.sqlite.org/cpu.html

There's a more basic profiler called gprof built into GCC; building your program with `-pg` will add profiling instrumentation in to it, so that it writes profiling information into a file called `gmon.out`. After running a program built with gprof, you can run `gprof testprogram` to analyze `gmon.out`; it will tell you how much time each subroutine took (including and excluding its transitive callees) and how many times it was called.

Linux also has a built-in profiler called `perf_events` https://perfwiki.github.io/main/ which uses hardware performance counters to get numbers like cachegrind's, but using the real hardware instead of simulation, and including a lot more detail; it can also profile the time your program spends in system calls. I've only used it in very basic ways, so I don't know very much about it. Intel also has a proprietary thing called VTune which can do things perf can't and runs on OSes that aren't Linux, but doesn't support AMD's processors.

— ⁂ —

I hope this is of some value to you! I'd be happy to answer any other questions. Hopefully correctly.

8dcc|1 year ago

Wow. This is probably the best comment I have ever seen on HN. Thank you so much for the information, I will add benchmarking to my 'libpool' project and mention it in the article. I can't think of any questions right now, but I will let you know if anything comes up. Thank you again. :)

8dcc|1 year ago

Actually, I have one question. You said:

> The actual benchmark program itself can be as simple as allocating and deallocating in a loop [...]

What do you mean, exactly? Deallocating immediately after allocating? I feel like there would be a big difference between that and allocating multiple blocks before freeing them.