(no title)
IainIreland | 2 years ago
Benchmarking is hard. It is very easy to write a benchmark where improving your score does not improve real-world performance, and over time even a good benchmark will become less useful as the important improvements are all made. This V8 blog post about Octane is a good description of some of the issues: https://v8.dev/blog/retiring-octane
Speedometer 3, in my experience, is the least bad browser benchmark. It hits code that we know from independent evidence is important for real-world performance. We've been targeting our performance work at Speedometer 3 for the last year, and we've seen good results. My favourite example: a few years ago, we decided that initial pageload performance was our performance priority for the year, and we spent some time trying to optimize for that. Speedometer 3 is not primarily a pageload benchmark. Nevertheless, our pageload telemetry improved more from targeting Speedometer 3 than it did when we were deliberately targeting pageload. (See the pretty graphs here: https://hacks.mozilla.org/2023/10/down-and-to-the-right-fire...) This is the advantage of having a good benchmark; it speeds up the iterative cycle of identifying a potential issue, writing a patch, and evaluating the results.
lapcat|2 years ago
21 is apparently better than 20, but how much better? You could say "1 better", tautologically, but how does that relate to the real world?
Driving a car 1 mile per hour faster may be better, in a sense, but even if you drove 24 hours straight, it would only gain you 24 total miles, which is almost negligible on such a long trip. Nobody would be impressed by that difference.
charcircuit|2 years ago
Vinnl|2 years ago
> "The score is a rescaled version of inverse time" is the key here.
> If you run all the tests in half the time, your Speedometer score will double. If your score improves by 1%, it implies that you are 1% faster on the subtests.
> (There are probably some subtleties here because we're using the geometric mean to avoid putting too much weight on any individual subtest, but the rough intuition should still hold.)
bigfudge|2 years ago