top | item 36683693

(no title)

> we measure the number of instructions and memory/cache accesses through CPU instrumentation performed with Valgrind. This approach gives repeatable and consistent results that couldn’t be obtained with a time based statistical approach, especially in extremely noisy CI and cloud environments.

This is a neat approach! I'm curious how well it maps to actual perf degradations though. Valgrind models an old CPU with a more naive branch predictor. For low-level branch-y code (say a native JSON parser), I'd be curious how well valgrind's simulated numbers map to real world measurements?

My probably naive intuition guesses that some low-level branchy code that valgrind thinks may be slower may run fine on a modern CPU (better branch predictor, deeper cache hierarchy). I'd expect false negatives to be rarer though - if valgrind thinks it's faster it probably is? What's your experience been like here?

discuss

art049|2 years ago

>This is a neat approach! I'm curious how well it maps to actual perf degradations though. Valgrind models an old CPU with a more naive branch predictor. For low-level branch-y code (say a native JSON parser), I'd be curious how well valgrind's simulated numbers map to real world measurements?

We didn't try it on this specific case but on we found that on branchy code valgrind does aggravate the branching cost. Probably, we could mitigate this issue by collecting more data relative to the branching and incorporate those in our reported numbers to map more accurately to the reality and more recent CPUs.

>My probably naive intuition guesses that some low-level branchy code that valgrind thinks may be slower may run fine on a modern CPU (better branch predictor, deeper cache hierarchy). I'd expect false negatives to be rarer though - if valgrind thinks it's faster it probably is? What's your experience been like here?

Totally! We never encountered a false positive in the reports yet. But as you mentioned since valgrind models an old CPU, it's likely to happen. But even though the cache simulated has a quite old, it still improves the relevance of our measures. When we have some time, we'd really enjoy refreshing the cache simulation of valgrind since it would probably eliminate some edge cases and reflect memory accesses more accurately.