Reliable JavaScript benchmarking with Tachometer

5 Aug

Reliable JavaScript benchmarking with Tachometer

Posted August 5, 2024 by Nolan Lawson in performance, Web. Tagged: benchmarking. Leave a Comment

Writing good benchmarks is hard. Even if you grasp the basics of performance timings and measurements, it’s easy to fool yourself:

You weren’t measuring what you thought you were measuring.
You got the answer you wanted, so you stopped looking.
You didn’t clean state between tests, so you were just measuring the cache.
You didn’t understand how JavaScript engines work, so it optimized away your code.

Et cetera, et cetera. Anyone who’s been doing web performance long enough probably has a story about how they wrote a benchmark that gave them some satisfying result, only to realize later that they flubbed some tiny detail that invalidated the whole thing. It can be crushing.

For years now, though, I’ve been using Tachometer for most browser-based benchmarks. It’s featured in this blog a few times, although I’ve never written specifically about it.

Tachometer doesn’t make benchmarking totally foolproof, but it does automate a lot of the trickiest bits. What I like best is that it:

Runs iterations until it reaches statistical significance.
Alternates between two scenarios in an interleaved fashion.
Launches a fresh browser profile between each iteration.

To be concrete, let’s say you have two scenarios you want to test: A and B. For Tachometer, these can simply be two web pages:

<!-- a.html -->
<script>
  scenarioA()
</script>

<!-- b.html -->
<script>
  scenarioB()
</script>

The only requirement is that you have something to measure, e.g. a performance measure:

function scenarioA() { // or B
  performance.mark('start')
  doStuff()
  performance.measure('total', 'start')
}

Now let’s say you want to know whether scenarioA or scenarioB is faster. Tachometer will essentially do the following:

Load a.html.
Load b.html.
Repeat steps 1-2 until reaching statistical confidence that A and B are different enough (e.g. 1% different, 10% different, etc.).

This has several nice properties:

Environment-specific variance is removed. Since you’re running A and B at the same time, on the same machine, in an interleaved fashion, it’s very unlikely that some environmental quirk will cause A to be artificially different from B.
You don’t have to guess how many iterations are “enough” – the statistical test does that for you.
The browser is fresh between each iteration, so you’re not just measuring cached/JITed performance.

That said, there are several downsides to this approach:

The less different A and B are, the longer the tool will take to tell you that there’s no difference. In a CI environment, this is basically a nonstarter, because 99% of PRs don’t affect performance, and you don’t want to spend hours of CI time just to find out that updating the README didn’t regress anything.
As mentioned, you’re not measuring JITed time. Sometimes you want to measure that, though. In that case, you have to run your own iterations-within-iterations (e.g. a for-loop) to avoid measuring the pre-JITed time.
…Which you may end up doing anyway, because the tool tends to work best when your iterations take a large enough chunk of time. In my experience, a minimum of 50ms is preferable, although throttling can help you get there.

Tachometer also lacks any kind of visualization of performance changes over time, although there is a good GitHub action that can report the difference between a PR and your main branch. (Although again, you might not want to run it on every PR.)

The way I typically use Tachometer is to run one-off benchmarks, for instance when I’m doing some kind of cross-browser performance analysis. I often generate the config files rather than hand-authoring them, since they can be a bit repetitive.

Also, I only use Tachometer for browser or JavaScript tests – for anything else, I’d probably look into Hyperfine. (Haven’t used it, but heard good things about it.)

Tachometer has served me well for a long time, and I’m a big fan. The main benefit I’ve found is just the consistency of its results. I’ve been surprised how many times it’s consistently reported some odd regression (even in the ~1% range), which I can then track down with a git bisect to find the culprit. It’s also great for validating (or debunking) proposed perf optimizations.

Like any tool, Tachometer definitely has its flaws, and it’s still possible to fool yourself with it. But until I find a better tool, this is my go-to for low-level JavaScript microbenchmarks.

Bonus tip: running Tachometer in --manual mode with the DevTools performance tab is a great way to validate that you’re measuring what you think you’re measuring. Pay attention to the “User Timings” section to ensure your timings line up with what you expect.

Read the Tea Leaves Software and other dark arts, by Nolan Lawson